Overview
The VoiceAgent class manages real-time voice conversations with AI models. Each instance holds its own conversation history, input queue, speech state, and WebSocket connection.
Important: Each VoiceAgent instance is designed for one user per instance. To support multiple concurrent users, create a separate VoiceAgent for each connection:
wss.on("connection", (socket) => {
const agent = new VoiceAgent({ model, ... });
agent.handleSocket(socket);
agent.on("disconnected", () => agent.destroy());
});
Sharing a single instance across multiple users will cause conversation history cross-contamination, interleaved audio, and unpredictable behavior.
Constructor
new VoiceAgent(options: VoiceAgentOptions)
Creates a new voice agent instance with the specified configuration.
VoiceAgentOptions
AI SDK Language Model for chat completions (e.g., openai('gpt-4o'), anthropic('claude-3.5-sonnet'))
AI SDK Transcription Model for converting speech to text (e.g., openai.transcription('whisper-1'))
AI SDK Speech Model for converting text to speech (e.g., openai.speech('tts-1'))
instructions
string
default:"You are a helpful voice assistant."
System instructions for the AI model
stopWhen
(event: StepEvent) => boolean
default:"stepCountIs(5)"
Condition to stop the stream (from AI SDK). Controls when multi-step tool calling should terminate
AI SDK tools available to the agent. Can be added later with registerTools()
WebSocket server URL to connect to (e.g., "ws://localhost:8080")
Voice ID or name for speech synthesis (provider-specific)
Instructions for speech synthesis behavior
Audio output format (e.g., "mp3", "opus", "pcm")
streamingSpeech
Partial<StreamingSpeechConfig>
Configuration for streaming speech generation:
minChunkSize (number, default: 50): Minimum characters before generating speech for a chunk
maxChunkSize (number, default: 200): Maximum characters per chunk (splits at sentence boundary)
parallelGeneration (boolean, default: true): Whether to enable parallel TTS generation
maxParallelRequests (number, default: 3): Maximum number of parallel TTS requests
Configuration for conversation history memory limits:
maxMessages (number, default: 100): Maximum number of messages to keep. Set to 0 for unlimited
maxTotalChars (number, default: 0): Maximum total characters across all messages. Set to 0 for unlimited
Maximum audio input size in bytes (default: 10 MB)
Example
import { VoiceAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";
const agent = new VoiceAgent({
model: openai("gpt-4o"),
transcriptionModel: openai.transcription("whisper-1"),
speechModel: openai.speech("tts-1"),
instructions: "You are a helpful assistant.",
voice: "alloy",
streamingSpeech: {
minChunkSize: 30,
maxChunkSize: 150,
},
history: {
maxMessages: 50,
},
});
Methods
Connection Management
connect()
async connect(url?: string): Promise<void>
Connect to a WebSocket server by URL.
WebSocket URL to connect to. If not provided, uses the endpoint from constructor options or defaults to "ws://localhost:8080"
Resolves when connection is established
Example:
await agent.connect("ws://localhost:8080");
handleSocket()
handleSocket(socket: WebSocket): void
Attach an existing WebSocket (server-side usage). Use this when handling incoming WebSocket connections in a server.
The WebSocket connection to attach
Example:
import { WebSocketServer } from "ws";
const wss = new WebSocketServer({ port: 8080 });
wss.on("connection", (socket) => {
const agent = new VoiceAgent({ model, ... });
agent.handleSocket(socket);
});
disconnect()
Disconnect from WebSocket and stop all in-flight work.
destroy()
Permanently destroy the agent, releasing all resources. Clears conversation history, removes all event listeners, and disconnects the WebSocket.
sendText()
async sendText(text: string): Promise<string>
Send text input for processing (bypasses transcription). The text is queued and processed serially.
The text input to process
The full text response from the AI model
Example:
const response = await agent.sendText("What's the weather today?");
console.log(response);
sendAudio()
async sendAudio(audioData: string): Promise<void>
Send base64-encoded audio data to be transcribed and processed.
Base64-encoded audio data
Example:
const base64Audio = "...";
await agent.sendAudio(base64Audio);
sendAudioBuffer()
async sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void>
Send raw audio buffer to be transcribed and processed.
audioBuffer
Buffer | Uint8Array
required
Raw audio buffer
Example:
const audioBuffer = fs.readFileSync("audio.wav");
await agent.sendAudioBuffer(audioBuffer);
Audio Processing
transcribeAudio()
async transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>
Transcribe audio data to text using the configured transcription model.
audioData
Buffer | Uint8Array
required
Raw audio data to transcribe
Example:
const transcript = await agent.transcribeAudio(audioBuffer);
console.log("User said:", transcript);
generateAndSendSpeechFull()
async generateAndSendSpeechFull(text: string): Promise<void>
Generate speech for full text at once (non-streaming fallback). Useful when you need to generate speech outside the normal streaming flow.
Text to convert to speech
Example:
await agent.generateAndSendSpeechFull("Hello, how can I help you?");
generateSpeechFromText()
async generateSpeechFromText(text: string, abortSignal?: AbortSignal): Promise<Uint8Array>
Generate speech from text using the configured speech model. Returns the raw audio data as a Uint8Array.
Text to convert to speech
Optional abort signal to cancel generation
Example:
const audioData = await agent.generateSpeechFromText("Hello world");
// Process or save the audio data
fs.writeFileSync("output.mp3", audioData);
Speech Control
interruptSpeech()
interruptSpeech(reason?: string): void
Interrupt ongoing speech generation and playback (barge-in support). This stops any currently playing audio and clears the speech queue.
reason
string
default:"interrupted"
Reason for interruption (used in events)
Example:
agent.interruptSpeech("user_speaking");
interruptCurrentResponse()
interruptCurrentResponse(reason?: string): void
Interrupt both the current LLM stream and ongoing speech. This is more comprehensive than interruptSpeech() as it also stops the AI model from generating more text.
reason
string
default:"interrupted"
Reason for interruption
Example:
agent.interruptCurrentResponse("user_speaking");
startListening()
Start listening for voice input. This method emits a listening event that you can use to update UI or trigger other actions.
Example:
agent.startListening();
agent.on("listening", () => {
console.log("Agent is now listening for voice input");
});
stopListening()
Stop listening for voice input. This method emits a stopped event.
Example:
agent.stopListening();
agent.on("stopped", () => {
console.log("Agent stopped listening");
});
History Management
clearHistory()
Clear the conversation history. Useful for starting a new conversation with the same agent.
Example:
getHistory()
getHistory(): ModelMessage[]
Get the current conversation history.
Array of conversation messages with role and content fields
Example:
const history = agent.getHistory();
console.log(`Conversation has ${history.length} messages`);
setHistory()
setHistory(history: ModelMessage[]): void
Set the conversation history (useful for restoring sessions).
Array of conversation messages to restore
Example:
// Save history
const history = agent.getHistory();
saveToDatabase(userId, history);
// Restore later
const savedHistory = loadFromDatabase(userId);
agent.setHistory(savedHistory);
registerTools(tools: Record<string, Tool>): void
Register additional tools for the agent to use. Tools are merged with existing tools.
tools
Record<string, Tool>
required
AI SDK tools to register
Example:
import { tool } from "ai";
import { z } from "zod";
agent.registerTools({
getWeather: tool({
description: "Get the weather for a location",
parameters: z.object({
location: z.string(),
}),
execute: async ({ location }) => {
// Fetch weather data
return { temperature: 72, condition: "sunny" };
},
}),
});
Properties
All properties are read-only.
connected
Whether the agent is currently connected to a WebSocket.
Example:
if (agent.connected) {
console.log("Agent is connected");
}
processing
get processing(): boolean
Whether the agent is currently processing input (LLM is generating a response).
Example:
if (agent.processing) {
console.log("Agent is thinking...");
}
speaking
Whether the agent is currently speaking (generating or playing audio).
Example:
if (agent.speaking) {
console.log("Agent is speaking...");
}
pendingSpeechChunks
get pendingSpeechChunks(): number
Number of speech chunks pending generation or playback.
Example:
console.log(`${agent.pendingSpeechChunks} chunks in queue`);
destroyed
Whether the agent has been permanently destroyed.
Example:
if (agent.destroyed) {
console.log("Agent has been destroyed");
}
Events
The VoiceAgent extends EventEmitter and emits various events during operation. Listen to events using the standard .on() method.
Connection Events
connected
Emitted when WebSocket connection is established.
agent.on("connected", () => {
console.log("Agent connected");
});
disconnected
Emitted when WebSocket connection is closed.
agent.on("disconnected", () => {
console.log("Agent disconnected");
agent.destroy();
});
error
Emitted when an error occurs.
agent.on("error", (error: Error) => {
console.error("Agent error:", error);
});
warning
Emitted for non-fatal warnings.
agent.on("warning", (message: string) => {
console.warn("Warning:", message);
});
listening
Emitted when startListening() is called.
agent.on("listening", () => {
console.log("Agent started listening for voice input");
});
stopped
Emitted when stopListening() is called.
agent.on("stopped", () => {
console.log("Agent stopped listening");
});
Conversation Events
text
Emitted when text is sent or received.
agent.on("text", ({ role, text }) => {
console.log(`${role}: ${text}`);
});
text_delta
Emitted for each text chunk during streaming.
agent.on("text_delta", ({ delta }) => {
process.stdout.write(delta);
});
Emitted when the AI calls a tool.
agent.on("tool_call", ({ name, args, toolCallId }) => {
console.log(`Tool called: ${name}`, args);
});
Emitted when a tool execution completes.
agent.on("tool_result", ({ name, result, toolCallId }) => {
console.log(`Tool result from ${name}:`, result);
});
Speech Events
speech_start
Emitted when speech generation starts.
agent.on("speech_start", () => {
console.log("Started speaking");
});
speech_complete
Emitted when all speech has finished playing.
agent.on("speech_complete", () => {
console.log("Finished speaking");
});
speech_interrupted
Emitted when speech is interrupted.
agent.on("speech_interrupted", ({ reason }) => {
console.log(`Speech interrupted: ${reason}`);
});
speech_chunk_queued
Emitted when a speech chunk is added to the generation queue.
agent.on("speech_chunk_queued", ({ chunkId, text }) => {
console.log(`Queued chunk ${chunkId}: ${text}`);
});
audio_chunk
Emitted when an audio chunk is generated.
agent.on("audio_chunk", ({ chunkId, audio }) => {
console.log(`Generated audio chunk ${chunkId}`);
});
audio
Emitted when complete audio data is ready.
agent.on("audio", ({ audio, format }) => {
console.log(`Audio ready: ${audio.length} bytes, format: ${format}`);
});
Transcription Events
transcription
Emitted when audio is transcribed to text.
agent.on("transcription", ({ text, duration }) => {
console.log(`Transcribed: ${text}`);
});
audio_received
Emitted when audio input is received.
agent.on("audio_received", ({ size, format }) => {
console.log(`Received audio: ${size} bytes, format: ${format}`);
});
History Events
history_cleared
Emitted when conversation history is cleared.
agent.on("history_cleared", () => {
console.log("History cleared");
});
history_trimmed
Emitted when conversation history is automatically trimmed.
agent.on("history_trimmed", ({ removedCount, reason }) => {
console.log(`Trimmed ${removedCount} messages: ${reason}`);
});