VoiceAgent - Voice Agent AI SDK

Overview

The VoiceAgent class manages real-time voice conversations with AI models. Each instance holds its own conversation history, input queue, speech state, and WebSocket connection. Important: Each VoiceAgent instance is designed for one user per instance. To support multiple concurrent users, create a separate VoiceAgent for each connection:

wss.on("connection", (socket) => {
  const agent = new VoiceAgent({ model, ... });
  agent.handleSocket(socket);
  agent.on("disconnected", () => agent.destroy());
});

Sharing a single instance across multiple users will cause conversation history cross-contamination, interleaved audio, and unpredictable behavior.

Constructor

new VoiceAgent(options: VoiceAgentOptions)

Creates a new voice agent instance with the specified configuration.

VoiceAgentOptions

model

LanguageModel

required

AI SDK Language Model for chat completions (e.g., openai('gpt-4o'), anthropic('claude-3.5-sonnet'))

transcriptionModel

TranscriptionModel

AI SDK Transcription Model for converting speech to text (e.g., openai.transcription('whisper-1'))

speechModel

SpeechModel

AI SDK Speech Model for converting text to speech (e.g., openai.speech('tts-1'))

instructions

string

default:"You are a helpful voice assistant."

System instructions for the AI model

stopWhen

(event: StepEvent) => boolean

default:"stepCountIs(5)"

Condition to stop the stream (from AI SDK). Controls when multi-step tool calling should terminate

tools

Record<string, Tool>

AI SDK tools available to the agent. Can be added later with registerTools()

endpoint

string

WebSocket server URL to connect to (e.g., "ws://localhost:8080")

voice

string

Voice ID or name for speech synthesis (provider-specific)

speechInstructions

string

Instructions for speech synthesis behavior

outputFormat

string

Audio output format (e.g., "mp3", "opus", "pcm")

streamingSpeech

Partial<StreamingSpeechConfig>

Configuration for streaming speech generation:

minChunkSize (number, default: 50): Minimum characters before generating speech for a chunk
maxChunkSize (number, default: 200): Maximum characters per chunk (splits at sentence boundary)
parallelGeneration (boolean, default: true): Whether to enable parallel TTS generation
maxParallelRequests (number, default: 3): Maximum number of parallel TTS requests

history

Partial<HistoryConfig>

Configuration for conversation history memory limits:

maxMessages (number, default: 100): Maximum number of messages to keep. Set to 0 for unlimited
maxTotalChars (number, default: 0): Maximum total characters across all messages. Set to 0 for unlimited

maxAudioInputSize

number

default:"10485760"

Maximum audio input size in bytes (default: 10 MB)

Example

import { VoiceAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";

const agent = new VoiceAgent({
  model: openai("gpt-4o"),
  transcriptionModel: openai.transcription("whisper-1"),
  speechModel: openai.speech("tts-1"),
  instructions: "You are a helpful assistant.",
  voice: "alloy",
  streamingSpeech: {
    minChunkSize: 30,
    maxChunkSize: 150,
  },
  history: {
    maxMessages: 50,
  },
});

Methods

Connection Management

connect()

async connect(url?: string): Promise<void>

Connect to a WebSocket server by URL.

url

string

WebSocket URL to connect to. If not provided, uses the endpoint from constructor options or defaults to "ws://localhost:8080"

void

Promise<void>

Resolves when connection is established

Example:

await agent.connect("ws://localhost:8080");

handleSocket()

handleSocket(socket: WebSocket): void

Attach an existing WebSocket (server-side usage). Use this when handling incoming WebSocket connections in a server.

socket

WebSocket

required

The WebSocket connection to attach

Example:

import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (socket) => {
  const agent = new VoiceAgent({ model, ... });
  agent.handleSocket(socket);
});

disconnect()

disconnect(): void

Disconnect from WebSocket and stop all in-flight work.

destroy()

destroy(): void

Permanently destroy the agent, releasing all resources. Clears conversation history, removes all event listeners, and disconnects the WebSocket.

Input Methods

sendText()

async sendText(text: string): Promise<string>

Send text input for processing (bypasses transcription). The text is queued and processed serially.

text

string

required

The text input to process

response

Promise<string>

The full text response from the AI model

Example:

const response = await agent.sendText("What's the weather today?");
console.log(response);

sendAudio()

async sendAudio(audioData: string): Promise<void>

Send base64-encoded audio data to be transcribed and processed.

audioData

string

required

Base64-encoded audio data

Example:

const base64Audio = "...";
await agent.sendAudio(base64Audio);

sendAudioBuffer()

async sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void>

Send raw audio buffer to be transcribed and processed.

audioBuffer

Buffer | Uint8Array

required

Raw audio buffer

Example:

const audioBuffer = fs.readFileSync("audio.wav");
await agent.sendAudioBuffer(audioBuffer);

Audio Processing

transcribeAudio()

async transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>

Transcribe audio data to text using the configured transcription model.

audioData

Buffer | Uint8Array

required

Raw audio data to transcribe

transcript

Promise<string>

The transcribed text

Example:

const transcript = await agent.transcribeAudio(audioBuffer);
console.log("User said:", transcript);

generateAndSendSpeechFull()

async generateAndSendSpeechFull(text: string): Promise<void>

Generate speech for full text at once (non-streaming fallback). Useful when you need to generate speech outside the normal streaming flow.

text

string

required

Text to convert to speech

Example:

await agent.generateAndSendSpeechFull("Hello, how can I help you?");

generateSpeechFromText()

async generateSpeechFromText(text: string, abortSignal?: AbortSignal): Promise<Uint8Array>

Generate speech from text using the configured speech model. Returns the raw audio data as a Uint8Array.

text

string

required

Text to convert to speech

abortSignal

AbortSignal

Optional abort signal to cancel generation

audio

Promise<Uint8Array>

The generated audio data

Example:

const audioData = await agent.generateSpeechFromText("Hello world");
// Process or save the audio data
fs.writeFileSync("output.mp3", audioData);

Speech Control

interruptSpeech()

interruptSpeech(reason?: string): void

Interrupt ongoing speech generation and playback (barge-in support). This stops any currently playing audio and clears the speech queue.

reason

string

default:"interrupted"

Reason for interruption (used in events)

Example:

agent.interruptSpeech("user_speaking");

interruptCurrentResponse()

interruptCurrentResponse(reason?: string): void

Interrupt both the current LLM stream and ongoing speech. This is more comprehensive than interruptSpeech() as it also stops the AI model from generating more text.

reason

string

default:"interrupted"

Reason for interruption

Example:

agent.interruptCurrentResponse("user_speaking");

startListening()

startListening(): void

Start listening for voice input. This method emits a listening event that you can use to update UI or trigger other actions. Example:

agent.startListening();
agent.on("listening", () => {
  console.log("Agent is now listening for voice input");
});

stopListening()

stopListening(): void

Stop listening for voice input. This method emits a stopped event. Example:

agent.stopListening();
agent.on("stopped", () => {
  console.log("Agent stopped listening");
});

History Management

clearHistory()

clearHistory(): void

Clear the conversation history. Useful for starting a new conversation with the same agent. Example:

agent.clearHistory();

getHistory()

getHistory(): ModelMessage[]

Get the current conversation history.

history

ModelMessage[]

Array of conversation messages with role and content fields

Example:

const history = agent.getHistory();
console.log(`Conversation has ${history.length} messages`);

setHistory()

setHistory(history: ModelMessage[]): void

Set the conversation history (useful for restoring sessions).

history

ModelMessage[]

required

Array of conversation messages to restore

Example:

// Save history
const history = agent.getHistory();
saveToDatabase(userId, history);

// Restore later
const savedHistory = loadFromDatabase(userId);
agent.setHistory(savedHistory);

Tool Management

registerTools()

registerTools(tools: Record<string, Tool>): void

tools

Record<string, Tool>

required

AI SDK tools to register

Example:

import { tool } from "ai";
import { z } from "zod";

agent.registerTools({
  getWeather: tool({
    description: "Get the weather for a location",
    parameters: z.object({
      location: z.string(),
    }),
    execute: async ({ location }) => {
      // Fetch weather data
      return { temperature: 72, condition: "sunny" };
    },
  }),
});

Properties

All properties are read-only.

connected

get connected(): boolean

Whether the agent is currently connected to a WebSocket. Example:

if (agent.connected) {
  console.log("Agent is connected");
}

processing

get processing(): boolean

Whether the agent is currently processing input (LLM is generating a response). Example:

if (agent.processing) {
  console.log("Agent is thinking...");
}

speaking

get speaking(): boolean

Whether the agent is currently speaking (generating or playing audio). Example:

if (agent.speaking) {
  console.log("Agent is speaking...");
}

pendingSpeechChunks

get pendingSpeechChunks(): number

Number of speech chunks pending generation or playback. Example:

console.log(`${agent.pendingSpeechChunks} chunks in queue`);

destroyed

get destroyed(): boolean

Whether the agent has been permanently destroyed. Example:

if (agent.destroyed) {
  console.log("Agent has been destroyed");
}

Events

The VoiceAgent extends EventEmitter and emits various events during operation. Listen to events using the standard .on() method.

Connection Events

connected

Emitted when WebSocket connection is established.

agent.on("connected", () => {
  console.log("Agent connected");
});

disconnected

Emitted when WebSocket connection is closed.

agent.on("disconnected", () => {
  console.log("Agent disconnected");
  agent.destroy();
});

error

Emitted when an error occurs.

agent.on("error", (error: Error) => {
  console.error("Agent error:", error);
});

warning

Emitted for non-fatal warnings.

agent.on("warning", (message: string) => {
  console.warn("Warning:", message);
});

listening

Emitted when startListening() is called.

agent.on("listening", () => {
  console.log("Agent started listening for voice input");
});

stopped

Emitted when stopListening() is called.

agent.on("stopped", () => {
  console.log("Agent stopped listening");
});

Conversation Events

text

Emitted when text is sent or received.

agent.on("text", ({ role, text }) => {
  console.log(`${role}: ${text}`);
});

text_delta

Emitted for each text chunk during streaming.

agent.on("text_delta", ({ delta }) => {
  process.stdout.write(delta);
});

tool_call

Emitted when the AI calls a tool.

agent.on("tool_call", ({ name, args, toolCallId }) => {
  console.log(`Tool called: ${name}`, args);
});

tool_result

Emitted when a tool execution completes.

agent.on("tool_result", ({ name, result, toolCallId }) => {
  console.log(`Tool result from ${name}:`, result);
});

Speech Events

speech_start

Emitted when speech generation starts.

agent.on("speech_start", () => {
  console.log("Started speaking");
});

speech_complete

Emitted when all speech has finished playing.

agent.on("speech_complete", () => {
  console.log("Finished speaking");
});

speech_interrupted

Emitted when speech is interrupted.

agent.on("speech_interrupted", ({ reason }) => {
  console.log(`Speech interrupted: ${reason}`);
});

speech_chunk_queued

Emitted when a speech chunk is added to the generation queue.

agent.on("speech_chunk_queued", ({ chunkId, text }) => {
  console.log(`Queued chunk ${chunkId}: ${text}`);
});

audio_chunk

Emitted when an audio chunk is generated.

agent.on("audio_chunk", ({ chunkId, audio }) => {
  console.log(`Generated audio chunk ${chunkId}`);
});

audio

Emitted when complete audio data is ready.

agent.on("audio", ({ audio, format }) => {
  console.log(`Audio ready: ${audio.length} bytes, format: ${format}`);
});

Transcription Events

transcription

Emitted when audio is transcribed to text.

agent.on("transcription", ({ text, duration }) => {
  console.log(`Transcribed: ${text}`);
});

audio_received

Emitted when audio input is received.

agent.on("audio_received", ({ size, format }) => {
  console.log(`Received audio: ${size} bytes, format: ${format}`);
});

History Events

history_cleared

Emitted when conversation history is cleared.

agent.on("history_cleared", () => {
  console.log("History cleared");
});

history_trimmed

Emitted when conversation history is automatically trimmed.

agent.on("history_trimmed", ({ removedCount, reason }) => {
  console.log(`Trimmed ${removedCount} messages: ${reason}`);
});

Agents

Core Managers

Types & Interfaces

Resources

​Overview

​Constructor

​VoiceAgentOptions

​Example

​Methods

​Connection Management

​connect()

​handleSocket()

​disconnect()

​destroy()

​Input Methods

​sendText()

​sendAudio()

​sendAudioBuffer()

​Audio Processing

​transcribeAudio()

​generateAndSendSpeechFull()

​generateSpeechFromText()

​Speech Control

​interruptSpeech()

​interruptCurrentResponse()

​startListening()

​stopListening()

​History Management

​clearHistory()

​getHistory()

​setHistory()

​Tool Management

​registerTools()

​Properties

​connected

​processing

​speaking

​pendingSpeechChunks

​destroyed

​Events

​Connection Events

​connected

​disconnected

​error

​warning

​listening

​stopped

​Conversation Events

​text

​text_delta

​tool_call

​tool_result

​Speech Events

​speech_start

​speech_complete

​speech_interrupted

​speech_chunk_queued

​audio_chunk

​audio

​Transcription Events

​transcription

​audio_received

​History Events

​history_cleared

​history_trimmed

Build docs developers (and LLMs) love

Overview

Constructor

VoiceAgentOptions

Example

Methods

Connection Management

connect()

handleSocket()

disconnect()

destroy()

Input Methods

sendText()

sendAudio()

sendAudioBuffer()

Audio Processing

transcribeAudio()

generateAndSendSpeechFull()

generateSpeechFromText()

Speech Control

interruptSpeech()

interruptCurrentResponse()

startListening()

stopListening()

History Management

clearHistory()

getHistory()

setHistory()

Tool Management

registerTools()

Properties

connected

processing

speaking

pendingSpeechChunks

destroyed

Events

Connection Events

connected

disconnected

error

warning

listening

stopped

Conversation Events

text

text_delta

tool_call

tool_result

Speech Events

speech_start

speech_complete

speech_interrupted

speech_chunk_queued

audio_chunk

audio

Transcription Events

transcription

audio_received

History Events

history_cleared

history_trimmed