Skip to main content

Overview

The VoiceAgent class manages real-time voice conversations with AI models. Each instance holds its own conversation history, input queue, speech state, and WebSocket connection. Important: Each VoiceAgent instance is designed for one user per instance. To support multiple concurrent users, create a separate VoiceAgent for each connection:
wss.on("connection", (socket) => {
  const agent = new VoiceAgent({ model, ... });
  agent.handleSocket(socket);
  agent.on("disconnected", () => agent.destroy());
});
Sharing a single instance across multiple users will cause conversation history cross-contamination, interleaved audio, and unpredictable behavior.

Constructor

new VoiceAgent(options: VoiceAgentOptions)
Creates a new voice agent instance with the specified configuration.

VoiceAgentOptions

model
LanguageModel
required
AI SDK Language Model for chat completions (e.g., openai('gpt-4o'), anthropic('claude-3.5-sonnet'))
transcriptionModel
TranscriptionModel
AI SDK Transcription Model for converting speech to text (e.g., openai.transcription('whisper-1'))
speechModel
SpeechModel
AI SDK Speech Model for converting text to speech (e.g., openai.speech('tts-1'))
instructions
string
default:"You are a helpful voice assistant."
System instructions for the AI model
stopWhen
(event: StepEvent) => boolean
default:"stepCountIs(5)"
Condition to stop the stream (from AI SDK). Controls when multi-step tool calling should terminate
tools
Record<string, Tool>
AI SDK tools available to the agent. Can be added later with registerTools()
endpoint
string
WebSocket server URL to connect to (e.g., "ws://localhost:8080")
voice
string
Voice ID or name for speech synthesis (provider-specific)
speechInstructions
string
Instructions for speech synthesis behavior
outputFormat
string
Audio output format (e.g., "mp3", "opus", "pcm")
streamingSpeech
Partial<StreamingSpeechConfig>
Configuration for streaming speech generation:
  • minChunkSize (number, default: 50): Minimum characters before generating speech for a chunk
  • maxChunkSize (number, default: 200): Maximum characters per chunk (splits at sentence boundary)
  • parallelGeneration (boolean, default: true): Whether to enable parallel TTS generation
  • maxParallelRequests (number, default: 3): Maximum number of parallel TTS requests
history
Partial<HistoryConfig>
Configuration for conversation history memory limits:
  • maxMessages (number, default: 100): Maximum number of messages to keep. Set to 0 for unlimited
  • maxTotalChars (number, default: 0): Maximum total characters across all messages. Set to 0 for unlimited
maxAudioInputSize
number
default:"10485760"
Maximum audio input size in bytes (default: 10 MB)

Example

import { VoiceAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";

const agent = new VoiceAgent({
  model: openai("gpt-4o"),
  transcriptionModel: openai.transcription("whisper-1"),
  speechModel: openai.speech("tts-1"),
  instructions: "You are a helpful assistant.",
  voice: "alloy",
  streamingSpeech: {
    minChunkSize: 30,
    maxChunkSize: 150,
  },
  history: {
    maxMessages: 50,
  },
});

Methods

Connection Management

connect()

async connect(url?: string): Promise<void>
Connect to a WebSocket server by URL.
url
string
WebSocket URL to connect to. If not provided, uses the endpoint from constructor options or defaults to "ws://localhost:8080"
void
Promise<void>
Resolves when connection is established
Example:
await agent.connect("ws://localhost:8080");

handleSocket()

handleSocket(socket: WebSocket): void
Attach an existing WebSocket (server-side usage). Use this when handling incoming WebSocket connections in a server.
socket
WebSocket
required
The WebSocket connection to attach
Example:
import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (socket) => {
  const agent = new VoiceAgent({ model, ... });
  agent.handleSocket(socket);
});

disconnect()

disconnect(): void
Disconnect from WebSocket and stop all in-flight work.

destroy()

destroy(): void
Permanently destroy the agent, releasing all resources. Clears conversation history, removes all event listeners, and disconnects the WebSocket.

Input Methods

sendText()

async sendText(text: string): Promise<string>
Send text input for processing (bypasses transcription). The text is queued and processed serially.
text
string
required
The text input to process
response
Promise<string>
The full text response from the AI model
Example:
const response = await agent.sendText("What's the weather today?");
console.log(response);

sendAudio()

async sendAudio(audioData: string): Promise<void>
Send base64-encoded audio data to be transcribed and processed.
audioData
string
required
Base64-encoded audio data
Example:
const base64Audio = "...";
await agent.sendAudio(base64Audio);

sendAudioBuffer()

async sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void>
Send raw audio buffer to be transcribed and processed.
audioBuffer
Buffer | Uint8Array
required
Raw audio buffer
Example:
const audioBuffer = fs.readFileSync("audio.wav");
await agent.sendAudioBuffer(audioBuffer);

Audio Processing

transcribeAudio()

async transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>
Transcribe audio data to text using the configured transcription model.
audioData
Buffer | Uint8Array
required
Raw audio data to transcribe
transcript
Promise<string>
The transcribed text
Example:
const transcript = await agent.transcribeAudio(audioBuffer);
console.log("User said:", transcript);

generateAndSendSpeechFull()

async generateAndSendSpeechFull(text: string): Promise<void>
Generate speech for full text at once (non-streaming fallback). Useful when you need to generate speech outside the normal streaming flow.
text
string
required
Text to convert to speech
Example:
await agent.generateAndSendSpeechFull("Hello, how can I help you?");

generateSpeechFromText()

async generateSpeechFromText(text: string, abortSignal?: AbortSignal): Promise<Uint8Array>
Generate speech from text using the configured speech model. Returns the raw audio data as a Uint8Array.
text
string
required
Text to convert to speech
abortSignal
AbortSignal
Optional abort signal to cancel generation
audio
Promise<Uint8Array>
The generated audio data
Example:
const audioData = await agent.generateSpeechFromText("Hello world");
// Process or save the audio data
fs.writeFileSync("output.mp3", audioData);

Speech Control

interruptSpeech()

interruptSpeech(reason?: string): void
Interrupt ongoing speech generation and playback (barge-in support). This stops any currently playing audio and clears the speech queue.
reason
string
default:"interrupted"
Reason for interruption (used in events)
Example:
agent.interruptSpeech("user_speaking");

interruptCurrentResponse()

interruptCurrentResponse(reason?: string): void
Interrupt both the current LLM stream and ongoing speech. This is more comprehensive than interruptSpeech() as it also stops the AI model from generating more text.
reason
string
default:"interrupted"
Reason for interruption
Example:
agent.interruptCurrentResponse("user_speaking");

startListening()

startListening(): void
Start listening for voice input. This method emits a listening event that you can use to update UI or trigger other actions. Example:
agent.startListening();
agent.on("listening", () => {
  console.log("Agent is now listening for voice input");
});

stopListening()

stopListening(): void
Stop listening for voice input. This method emits a stopped event. Example:
agent.stopListening();
agent.on("stopped", () => {
  console.log("Agent stopped listening");
});

History Management

clearHistory()

clearHistory(): void
Clear the conversation history. Useful for starting a new conversation with the same agent. Example:
agent.clearHistory();

getHistory()

getHistory(): ModelMessage[]
Get the current conversation history.
history
ModelMessage[]
Array of conversation messages with role and content fields
Example:
const history = agent.getHistory();
console.log(`Conversation has ${history.length} messages`);

setHistory()

setHistory(history: ModelMessage[]): void
Set the conversation history (useful for restoring sessions).
history
ModelMessage[]
required
Array of conversation messages to restore
Example:
// Save history
const history = agent.getHistory();
saveToDatabase(userId, history);

// Restore later
const savedHistory = loadFromDatabase(userId);
agent.setHistory(savedHistory);

Tool Management

registerTools()

registerTools(tools: Record<string, Tool>): void
Register additional tools for the agent to use. Tools are merged with existing tools.
tools
Record<string, Tool>
required
AI SDK tools to register
Example:
import { tool } from "ai";
import { z } from "zod";

agent.registerTools({
  getWeather: tool({
    description: "Get the weather for a location",
    parameters: z.object({
      location: z.string(),
    }),
    execute: async ({ location }) => {
      // Fetch weather data
      return { temperature: 72, condition: "sunny" };
    },
  }),
});

Properties

All properties are read-only.

connected

get connected(): boolean
Whether the agent is currently connected to a WebSocket. Example:
if (agent.connected) {
  console.log("Agent is connected");
}

processing

get processing(): boolean
Whether the agent is currently processing input (LLM is generating a response). Example:
if (agent.processing) {
  console.log("Agent is thinking...");
}

speaking

get speaking(): boolean
Whether the agent is currently speaking (generating or playing audio). Example:
if (agent.speaking) {
  console.log("Agent is speaking...");
}

pendingSpeechChunks

get pendingSpeechChunks(): number
Number of speech chunks pending generation or playback. Example:
console.log(`${agent.pendingSpeechChunks} chunks in queue`);

destroyed

get destroyed(): boolean
Whether the agent has been permanently destroyed. Example:
if (agent.destroyed) {
  console.log("Agent has been destroyed");
}

Events

The VoiceAgent extends EventEmitter and emits various events during operation. Listen to events using the standard .on() method.

Connection Events

connected

Emitted when WebSocket connection is established.
agent.on("connected", () => {
  console.log("Agent connected");
});

disconnected

Emitted when WebSocket connection is closed.
agent.on("disconnected", () => {
  console.log("Agent disconnected");
  agent.destroy();
});

error

Emitted when an error occurs.
agent.on("error", (error: Error) => {
  console.error("Agent error:", error);
});

warning

Emitted for non-fatal warnings.
agent.on("warning", (message: string) => {
  console.warn("Warning:", message);
});

listening

Emitted when startListening() is called.
agent.on("listening", () => {
  console.log("Agent started listening for voice input");
});

stopped

Emitted when stopListening() is called.
agent.on("stopped", () => {
  console.log("Agent stopped listening");
});

Conversation Events

text

Emitted when text is sent or received.
agent.on("text", ({ role, text }) => {
  console.log(`${role}: ${text}`);
});

text_delta

Emitted for each text chunk during streaming.
agent.on("text_delta", ({ delta }) => {
  process.stdout.write(delta);
});

tool_call

Emitted when the AI calls a tool.
agent.on("tool_call", ({ name, args, toolCallId }) => {
  console.log(`Tool called: ${name}`, args);
});

tool_result

Emitted when a tool execution completes.
agent.on("tool_result", ({ name, result, toolCallId }) => {
  console.log(`Tool result from ${name}:`, result);
});

Speech Events

speech_start

Emitted when speech generation starts.
agent.on("speech_start", () => {
  console.log("Started speaking");
});

speech_complete

Emitted when all speech has finished playing.
agent.on("speech_complete", () => {
  console.log("Finished speaking");
});

speech_interrupted

Emitted when speech is interrupted.
agent.on("speech_interrupted", ({ reason }) => {
  console.log(`Speech interrupted: ${reason}`);
});

speech_chunk_queued

Emitted when a speech chunk is added to the generation queue.
agent.on("speech_chunk_queued", ({ chunkId, text }) => {
  console.log(`Queued chunk ${chunkId}: ${text}`);
});

audio_chunk

Emitted when an audio chunk is generated.
agent.on("audio_chunk", ({ chunkId, audio }) => {
  console.log(`Generated audio chunk ${chunkId}`);
});

audio

Emitted when complete audio data is ready.
agent.on("audio", ({ audio, format }) => {
  console.log(`Audio ready: ${audio.length} bytes, format: ${format}`);
});

Transcription Events

transcription

Emitted when audio is transcribed to text.
agent.on("transcription", ({ text, duration }) => {
  console.log(`Transcribed: ${text}`);
});

audio_received

Emitted when audio input is received.
agent.on("audio_received", ({ size, format }) => {
  console.log(`Received audio: ${size} bytes, format: ${format}`);
});

History Events

history_cleared

Emitted when conversation history is cleared.
agent.on("history_cleared", () => {
  console.log("History cleared");
});

history_trimmed

Emitted when conversation history is automatically trimmed.
agent.on("history_trimmed", ({ removedCount, reason }) => {
  console.log(`Trimmed ${removedCount} messages: ${reason}`);
});

Build docs developers (and LLMs) love