Enterprise Voice Agent Integration

Connect a voice agent with a secure backend-to-frontend session flow.

This guide shows how to create an API key, configure an agent, generate a session from your backend, and pass only the short-lived credentials to the browser.

Browser receives

Short-lived session token

LiveKit websocket URL

Keep `x-api-key` and any backend credentials server-side. The frontend only needs enough information to connect to the audio room.

Secure by design

The enterprise API key belongs only on your backend. Every client-side interaction should use temporary session credentials returned by your server.

Flexible session overrides

Sessions can inherit agent defaults or override language, prompt, greeting, and voice for a single call.

Voice and text in one room

Once your backend returns the session payload, the frontend uses livekit-client to stream the agent's voice and exchange text-chat messages over the same LiveKit data channel.

Setup flow

From API key generation to a live voice session

01Step

Generate an API key

Open app.oshara.ai/settings?tab=developer and create an API key that starts with sk_... Store it only in your backend secrets store. The key is shown once and must never be exposed to the browser.

02Step

Create a voice agent

Go to app.oshara.ai/agents and create an agent. Configure the default reference voice, system prompt, greeting message, and language. Keep the Agent ID from the agent detail page because you will send it when creating a session.

03Step

Enrich the agent

Optionally attach documents to the Knowledge Base and connect an MCP server if the agent needs external tools or actions during calls.

04Step

Create a session from your backend

Call the enterprise API directly from your backend. Send the API key in x-api-key, include the required agent field, and override language, system prompt, greeting, or reference_audio_url only when needed for that session.

05Step

Receive session credentials

A successful response returns a short-lived token, the LiveKit websocket URL, and a unique room name. These credentials are generated server-side for a single session handshake.

06Step

Forward credentials to the frontend

Return only the token and livekit_url to the browser. Keep the API key and any other backend-only secrets away from the client at all times.

07Step

Connect voice and text chat in the frontend

Install livekit-client in your frontend app and join the room with the temporary session credentials. The same room delivers the agent's voice over an audio track and exchanges chat messages over the data channel — publish user messages on the voice.user_text topic and render agent replies received on voice.reply.

Backend session request

Call the session endpoint from your backend service. Send the API key as a request header, and include the agent ID as the required JSON field.

POST https://api.oshara.ai/api/agents/agent-session/
x-api-key: sk_...
Content-Type: application/json

{
  "agent": "agent_id",
  "language": "en",
  "system_prompt": "...",
  "greeting": "...",
  "reference_audio_url": "..."
}

Session response

After the backend creates the session, return only the credentials needed by the frontend. The token authorizes the room connection, and the LiveKit URL tells the client where to connect.

{
  "success": true,
  "message": "Request successful",
  "data": {
    "token": "eyJ...",
    "livekit_url": "wss://audio-inference.oshara.ai",
    "room_name": "unique_room_id"
  }
}

Frontend connection (voice + text chat)

Install `livekit-client` in the frontend app and connect using the credentials returned by your backend. The same room handles both the spoken audio track and a text-chat channel — publish user messages on the `voice.user_text` topic and listen for agent replies on `voice.reply`.

import { Room, RoomEvent, Track } from "livekit-client";

type AgentSession = {
  token: string;
  livekit_url: string;
  room_name?: string;
};

type ChatMessage = {
  role: "user" | "assistant";
  text: string;
};

// The agent echoes typed messages back on "voice.reply" with
// type: "user_text" — the same shape it uses for STT transcripts of
// spoken input. We track texts the user just typed so we can render them
// optimistically and skip the matching echo when it arrives.
const pendingUserTextsByRoom = new WeakMap<Room, string[]>();

async function getAgentSession(agentId: string): Promise<AgentSession> {
  const response = await fetch("/api/voice-agent/session", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      agent: agentId,
      language: "en",
      system_prompt: "...",
      greeting: "...",
      reference_audio_url: "...",
    }),
  });

  if (!response.ok) {
    const message = await response.text();
    throw new Error(message || "Failed to create voice-agent session");
  }

  const payload = await response.json();
  return payload.data as AgentSession;
}

export async function startVoiceAgentCall(
  agentId: string,
  onMessage?: (message: ChatMessage) => void,
) {
  const room = new Room({
    adaptiveStream: true,
    dynacast: true,
  });
  pendingUserTextsByRoom.set(room, []);

  room.on(RoomEvent.Connected, () => {
    console.log("Connected to voice agent room");
  });

  room.on(RoomEvent.Disconnected, () => {
    pendingUserTextsByRoom.delete(room);
    console.log("Disconnected from room");
  });

  // Subscribe to the agent's spoken audio track.
  room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
    if (track.kind === Track.Kind.Audio) {
      const element = track.attach();
      element.autoplay = true;
      document.body.appendChild(element);
      console.log("Subscribed to audio track from", participant.identity);
    }
  });

  // Receive agent replies and STT transcripts on the "voice.reply" topic.
  room.on(RoomEvent.DataReceived, (payload, _participant, _kind, topic) => {
    if (topic !== "voice.reply") return;

    try {
      const data = JSON.parse(new TextDecoder().decode(payload));
      const text = (data.text || data.message || "").trim();
      if (!text) return;

      // Drop echoes of messages the user just typed (already rendered).
      if (data.type === "user_text") {
        const pending = pendingUserTextsByRoom.get(room);
        if (pending) {
          const index = pending.indexOf(text);
          if (index !== -1) {
            pending.splice(index, 1);
            return;
          }
        }
      }

      onMessage?.({
        role: data.type === "user_text" ? "user" : "assistant",
        text,
      });
    } catch {
      // Ignore unparseable data frames.
    }
  });

  try {
    const session = await getAgentSession(agentId);

    if (!session.token || !session.livekit_url) {
      throw new Error("Session response is missing token or livekit_url");
    }

    await room.connect(session.livekit_url, session.token, {
      autoSubscribe: true,
    });

    await room.localParticipant.setMicrophoneEnabled(true);

    return room;
  } catch (error) {
    pendingUserTextsByRoom.delete(room);
    room.disconnect();
    throw error;
  }
}

// Send a user-typed message into the live agent session. Render the
// message in your UI immediately after calling this — the matching echo
// from the agent will be deduped by the DataReceived handler above.
export async function sendTextMessage(room: Room, text: string) {
  const trimmed = text.trim();
  if (!trimmed) return;

  pendingUserTextsByRoom.get(room)?.push(trimmed);

  await room.localParticipant.publishData(
    new TextEncoder().encode(
      JSON.stringify({
        type: "user_text",
        text: trimmed,
      }),
    ),
    {
      reliable: true,
      topic: "voice.user_text",
    },
  );
}

export async function endVoiceAgentCall(room: Room | null) {
  if (!room) return;
  pendingUserTextsByRoom.delete(room);
  await room.localParticipant.setMicrophoneEnabled(false);
  room.disconnect();
}

Minimal call view UI

A drop-in React component that wires the helpers above to a small UI — a connection indicator, start/end buttons, a message list, and a text input. Drop it into any client route and pass an agentId.

"use client";

import { useEffect, useRef, useState } from "react";
import type { Room } from "livekit-client";
import {
  endVoiceAgentCall,
  sendTextMessage,
  startVoiceAgentCall,
} from "./voice-agent";

type ChatMessage = {
  role: "user" | "assistant";
  text: string;
};

export function CallView({ agentId }: { agentId: string }) {
  const roomRef = useRef<Room | null>(null);
  const [status, setStatus] = useState<"idle" | "connecting" | "live">("idle");
  const [messages, setMessages] = useState<ChatMessage[]>([]);
  const [draft, setDraft] = useState("");

  useEffect(() => {
    return () => {
      endVoiceAgentCall(roomRef.current);
      roomRef.current = null;
    };
  }, []);

  const handleStart = async () => {
    if (status !== "idle") return;
    setStatus("connecting");
    try {
      const room = await startVoiceAgentCall(agentId, (message) => {
        setMessages((prev) => [...prev, message]);
      });
      roomRef.current = room;
      setStatus("live");
    } catch (error) {
      console.error(error);
      setStatus("idle");
    }
  };

  const handleEnd = async () => {
    await endVoiceAgentCall(roomRef.current);
    roomRef.current = null;
    setStatus("idle");
  };

  const handleSend = async (event: React.FormEvent) => {
    event.preventDefault();
    const text = draft.trim();
    if (!text || !roomRef.current) return;
    setMessages((prev) => [...prev, { role: "user", text }]);
    setDraft("");
    await sendTextMessage(roomRef.current, text);
  };

  return (
    <div className="mx-auto flex h-[520px] w-full max-w-md flex-col rounded-2xl border border-slate-200 bg-white shadow-sm">
      <header className="flex items-center justify-between border-b border-slate-200 px-4 py-3">
        <div className="flex items-center gap-2">
          <span
            className={
              "h-2.5 w-2.5 rounded-full " +
              (status === "live"
                ? "bg-emerald-500"
                : status === "connecting"
                  ? "bg-amber-500"
                  : "bg-slate-300")
            }
          />
          <p className="text-sm font-medium text-slate-700">
            {status === "live"
              ? "Connected"
              : status === "connecting"
                ? "Connecting…"
                : "Not connected"}
          </p>
        </div>
        {status === "idle" ? (
          <button
            type="button"
            onClick={handleStart}
            className="rounded-full bg-orange-500 px-3 py-1.5 text-sm font-medium text-white"
          >
            Start call
          </button>
        ) : (
          <button
            type="button"
            onClick={handleEnd}
            className="rounded-full bg-slate-900 px-3 py-1.5 text-sm font-medium text-white"
          >
            End call
          </button>
        )}
      </header>

      <div className="flex-1 space-y-3 overflow-y-auto px-4 py-4">
        {messages.length === 0 ? (
          <p className="text-sm text-slate-400">
            Start the call and speak — or type a message below.
          </p>
        ) : (
          messages.map((message, index) => (
            <div
              key={index}
              className={
                "flex " +
                (message.role === "user" ? "justify-end" : "justify-start")
              }
            >
              <span
                className={
                  "max-w-[80%] rounded-2xl px-3 py-2 text-sm " +
                  (message.role === "user"
                    ? "bg-orange-500 text-white"
                    : "bg-slate-100 text-slate-800")
                }
              >
                {message.text}
              </span>
            </div>
          ))
        )}
      </div>

      <form
        onSubmit={handleSend}
        className="flex items-center gap-2 border-t border-slate-200 px-4 py-3"
      >
        <input
          value={draft}
          onChange={(event) => setDraft(event.target.value)}
          placeholder="Type a message"
          disabled={status !== "live"}
          className="flex-1 rounded-full border border-slate-200 bg-slate-50 px-4 py-2 text-sm outline-none focus:border-orange-300"
        />
        <button
          type="submit"
          disabled={status !== "live" || !draft.trim()}
          className="rounded-full bg-orange-500 px-4 py-2 text-sm font-medium text-white disabled:opacity-50"
        >
          Send
        </button>
      </form>
    </div>
  );
}