Enterprise Voice Agent Integration

Connect a voice agent with a secure backend-to-frontend session flow.

This guide shows how to create an API key, configure an agent, generate a session from your backend, and pass only the short-lived credentials to the browser.

Browser receives

Short-lived session token

LiveKit websocket URL

Keep x-api-key and any backend credentials server-side. The frontend only needs enough information to connect to the audio room.

Secure by design

The enterprise API key belongs only on your backend. Every client-side interaction should use temporary session credentials returned by your server.

Flexible session overrides

Sessions can inherit agent defaults or override language, prompt, greeting, and voice for a single call.

Voice and text in one room

Once your backend returns the session payload, the frontend uses livekit-client to stream the agent's voice and exchange text-chat messages over the same LiveKit data channel.

Setup flow

From API key generation to a live voice session

01
Generate an API key
Open app.oshara.ai/settings?tab=developer and create an API key that starts with sk_... Store it only in your backend secrets store. The key is shown once and must never be exposed to the browser.
02
Create a voice agent
Go to app.oshara.ai/agents and create an agent. Configure the default reference voice, system prompt, greeting message, and language. Keep the Agent ID from the agent detail page because you will send it when creating a session.
03
Enrich the agent
Optionally attach documents to the Knowledge Base and connect an MCP server if the agent needs external tools or actions during calls.
04
Create a session from your backend
Call the enterprise API directly from your backend. Send the API key in x-api-key, include the required agent field, and override language, system prompt, greeting, or reference_audio_url only when needed for that session.
05
Receive session credentials
A successful response returns a short-lived token, the LiveKit websocket URL, and a unique room name. These credentials are generated server-side for a single session handshake.
06
Forward credentials to the frontend
Return only the token and livekit_url to the browser. Keep the API key and any other backend-only secrets away from the client at all times.
07
Connect voice and text chat in the frontend
Install livekit-client in your frontend app and join the room with the temporary session credentials. The same room delivers the agent's voice over an audio track and exchanges chat messages over the data channel — publish user messages on the voice.user_text topic and render agent replies received on voice.reply.

Backend session request

Call the session endpoint from your backend service. Send the API key as a request header, and include the agent ID as the required JSON field.

POST /api/agents/agent-session/

POST https://api.oshara.ai/api/agents/agent-session/
x-api-key: sk_...
Content-Type: application/json

{
  "agent": "agent_id",
  "language": "en",
  "system_prompt": "...",
  "greeting": "...",
  "reference_audio_url": "..."
}

Important notes

The API key is backend-only and should never be embedded in client code, NEXT_PUBLIC_ variables, or browser network requests.

The session payload can override the agent defaults, which makes it easy to personalize the same agent for different users or workflows.

For production, store the session response, room name, and call metadata on your server for audit trails and troubleshooting.

Typed messages are echoed back on the voice.reply topic with type: "user_text" — the same shape used for STT transcripts of spoken input. If you render typed messages optimistically, dedupe the echo (see the frontend snippet) so the same message does not appear twice.

Session response

After the backend creates the session, return only the credentials needed by the frontend. The token authorizes the room connection, and the LiveKit URL tells the client where to connect.

200 OK · JSON

{
  "success": true,
  "message": "Request successful",
  "data": {
    "token": "eyJ...",
    "livekit_url": "wss://audio-inference.oshara.ai",
    "room_name": "unique_room_id"
  }
}

Frontend connection (voice + text chat)

Install livekit-client in the frontend app and connect using the credentials returned by your backend. The same room handles both the spoken audio track and a text-chat channel — publish user messages on the voice.user_text topic and listen for agent replies on voice.reply.

voice-agent.ts

import { Room, RoomEvent, Track } from "livekit-client";

type AgentSession = {
  token: string;
  livekit_url: string;
  room_name?: string;
};

type ChatMessage = {
  role: "user" | "assistant";
  text: string;
};

// The agent echoes typed messages back on "voice.reply" with
// type: "user_text" — the same shape it uses for STT transcripts of
// spoken input. We track texts the user just typed so we can render them
// optimistically and skip the matching echo when it arrives.
const pendingUserTextsByRoom = new WeakMap<Room, string[]>();

async function getAgentSession(agentId: string): Promise<AgentSession> {
  const response = await fetch("/api/voice-agent/session", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      agent: agentId,
      language: "en",
      system_prompt: "...",
      greeting: "...",
      reference_audio_url: "...",
    }),
  });

  if (!response.ok) {
    const message = await response.text();
    throw new Error(message || "Failed to create voice-agent session");
  }

  const payload = await response.json();
  return payload.data as AgentSession;
}

export async function startVoiceAgentCall(
  agentId: string,
  onMessage?: (message: ChatMessage) => void,
) {
  const room = new Room({
    adaptiveStream: true,
    dynacast: true,
  });
  pendingUserTextsByRoom.set(room, []);

  room.on(RoomEvent.Connected, () => {
    console.log("Connected to voice agent room");
  });

  room.on(RoomEvent.Disconnected, () => {
    pendingUserTextsByRoom.delete(room);
    console.log("Disconnected from room");
  });

  // Subscribe to the agent's spoken audio track.
  room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
    if (track.kind === Track.Kind.Audio) {
      const element = track.attach();
      element.autoplay = true;
      document.body.appendChild(element);
      console.log("Subscribed to audio track from", participant.identity);
    }
  });

  // Receive agent replies and STT transcripts on the "voice.reply" topic.
  room.on(RoomEvent.DataReceived, (payload, _participant, _kind, topic) => {
    if (topic !== "voice.reply") return;

    try {
      const data = JSON.parse(new TextDecoder().decode(payload));
      const text = (data.text || data.message || "").trim();
      if (!text) return;

      // Drop echoes of messages the user just typed (already rendered).
      if (data.type === "user_text") {
        const pending = pendingUserTextsByRoom.get(room);
        if (pending) {
          const index = pending.indexOf(text);
          if (index !== -1) {
            pending.splice(index, 1);
            return;
          }
        }
      }

      onMessage?.({
        role: data.type === "user_text" ? "user" : "assistant",
        text,
      });
    } catch {
      // Ignore unparseable data frames.
    }
  });

  try {
    const session = await getAgentSession(agentId);

    if (!session.token || !session.livekit_url) {
      throw new Error("Session response is missing token or livekit_url");
    }

    await room.connect(session.livekit_url, session.token, {
      autoSubscribe: true,
    });

    await room.localParticipant.setMicrophoneEnabled(true);

    return room;
  } catch (error) {
    pendingUserTextsByRoom.delete(room);
    room.disconnect();
    throw error;
  }
}

// Send a user-typed message into the live agent session. Render the
// message in your UI immediately after calling this — the matching echo
// from the agent will be deduped by the DataReceived handler above.
export async function sendTextMessage(room: Room, text: string) {
  const trimmed = text.trim();
  if (!trimmed) return;

  pendingUserTextsByRoom.get(room)?.push(trimmed);

  await room.localParticipant.publishData(
    new TextEncoder().encode(
      JSON.stringify({
        type: "user_text",
        text: trimmed,
      }),
    ),
    {
      reliable: true,
      topic: "voice.user_text",
    },
  );
}

export async function endVoiceAgentCall(room: Room | null) {
  if (!room) return;
  pendingUserTextsByRoom.delete(room);
  await room.localParticipant.setMicrophoneEnabled(false);
  room.disconnect();
}

Minimal call view UI

A drop-in React component that wires the helpers above to a small UI — a connection indicator, start/end buttons, a message list, and a text input. Drop it into any client route and pass an agentId.

call-view.tsx

"use client";

import { useEffect, useRef, useState } from "react";
import type { Room } from "livekit-client";
import {
  endVoiceAgentCall,
  sendTextMessage,
  startVoiceAgentCall,
} from "./voice-agent";

type ChatMessage = {
  role: "user" | "assistant";
  text: string;
};

export function CallView({ agentId }: { agentId: string }) {
  const roomRef = useRef<Room | null>(null);
  const [status, setStatus] = useState<"idle" | "connecting" | "live">("idle");
  const [messages, setMessages] = useState<ChatMessage[]>([]);
  const [draft, setDraft] = useState("");

  useEffect(() => {
    return () => {
      endVoiceAgentCall(roomRef.current);
      roomRef.current = null;
    };
  }, []);

  const handleStart = async () => {
    if (status !== "idle") return;
    setStatus("connecting");
    try {
      const room = await startVoiceAgentCall(agentId, (message) => {
        setMessages((prev) => [...prev, message]);
      });
      roomRef.current = room;
      setStatus("live");
    } catch (error) {
      console.error(error);
      setStatus("idle");
    }
  };

  const handleEnd = async () => {
    await endVoiceAgentCall(roomRef.current);
    roomRef.current = null;
    setStatus("idle");
  };

  const handleSend = async (event: React.FormEvent) => {
    event.preventDefault();
    const text = draft.trim();
    if (!text || !roomRef.current) return;
    setMessages((prev) => [...prev, { role: "user", text }]);
    setDraft("");
    await sendTextMessage(roomRef.current, text);
  };

  return (
    <div className="mx-auto flex h-[520px] w-full max-w-md flex-col rounded-2xl border border-slate-200 bg-white shadow-sm">
      <header className="flex items-center justify-between border-b border-slate-200 px-4 py-3">
        <div className="flex items-center gap-2">
          <span
            className={
              "h-2.5 w-2.5 rounded-full " +
              (status === "live"
                ? "bg-emerald-500"
                : status === "connecting"
                  ? "bg-amber-500"
                  : "bg-slate-300")
            }
          />
          <p className="text-sm font-medium text-slate-700">
            {status === "live"
              ? "Connected"
              : status === "connecting"
                ? "Connecting…"
                : "Not connected"}
          </p>
        </div>
        {status === "idle" ? (
          <button
            type="button"
            onClick={handleStart}
            className="rounded-full bg-orange-500 px-3 py-1.5 text-sm font-medium text-white"
          >
            Start call
          </button>
        ) : (
          <button
            type="button"
            onClick={handleEnd}
            className="rounded-full bg-slate-900 px-3 py-1.5 text-sm font-medium text-white"
          >
            End call
          </button>
        )}
      </header>

      <div className="flex-1 space-y-3 overflow-y-auto px-4 py-4">
        {messages.length === 0 ? (
          <p className="text-sm text-slate-400">
            Start the call and speak — or type a message below.
          </p>
        ) : (
          messages.map((message, index) => (
            <div
              key={index}
              className={
                "flex " +
                (message.role === "user" ? "justify-end" : "justify-start")
              }
            >
              <span
                className={
                  "max-w-[80%] rounded-2xl px-3 py-2 text-sm " +
                  (message.role === "user"
                    ? "bg-orange-500 text-white"
                    : "bg-slate-100 text-slate-800")
                }
              >
                {message.text}
              </span>
            </div>
          ))
        )}
      </div>

      <form
        onSubmit={handleSend}
        className="flex items-center gap-2 border-t border-slate-200 px-4 py-3"
      >
        <input
          value={draft}
          onChange={(event) => setDraft(event.target.value)}
          placeholder="Type a message"
          disabled={status !== "live"}
          className="flex-1 rounded-full border border-slate-200 bg-slate-50 px-4 py-2 text-sm outline-none focus:border-orange-300"
        />
        <button
          type="submit"
          disabled={status !== "live" || !draft.trim()}
          className="rounded-full bg-orange-500 px-4 py-2 text-sm font-medium text-white disabled:opacity-50"
        >
          Send
        </button>
      </form>
    </div>
  );
}

Connect a voice agent with a secure backend-to-frontend session flow.

Secure by design

Flexible session overrides

Voice and text in one room

From API key generation to a live voice session

Generate an API key

Create a voice agent

Enrich the agent

Create a session from your backend

Receive session credentials

Forward credentials to the frontend

Connect voice and text chat in the frontend

Backend session request

Important notes

Session response

Frontend connection (voice + text chat)

Minimal call view UI