VoiceAgent Architecture

Overview

The VoiceAgent is the primary class for building voice-enabled AI assistants. It orchestrates streaming text generation, audio transcription, text-to-speech synthesis, and WebSocket communication through a modular, event-driven architecture.

Core Architecture

One Instance Per User

Critical: Each VoiceAgent instance is designed for one user per instance. The agent maintains its own:

Conversation history
Input queue
Speech state
WebSocket connection

Sharing a single VoiceAgent instance across multiple users will cause:

Conversation history cross-contamination
Interleaved audio playback
Race conditions and unpredictable behavior

For multi-user applications, create a separate agent for each connection:

import { VoiceAgent } from 'voice-agent-ai-sdk';
import { openai } from '@ai-sdk/openai';

wss.on('connection', (socket) => {
  const agent = new VoiceAgent({
    model: openai('gpt-4o'),
    transcriptionModel: openai.transcription('whisper-1'),
    speechModel: openai.speech('gpt-4o-mini-tts'),
    instructions: 'You are a helpful voice assistant.',
  });
  
  agent.handleSocket(socket);
  
  // Clean up when user disconnects
  agent.on('disconnected', () => {
    agent.destroy();
  });
});

Manager-Based Architecture

The VoiceAgent delegates responsibilities to specialized managers:

WebSocketManager

Handles WebSocket lifecycle, connection state, and message routing.

/home/daytona/workspace/source/src/core/WebSocketManager.ts:9-63

SpeechManager

Manages streaming speech generation, chunk queueing, and parallel TTS requests.

/home/daytona/workspace/source/src/core/SpeechManager.ts:24-54

ConversationManager

Maintains conversation history with automatic trimming based on HistoryConfig.

/home/daytona/workspace/source/src/core/ConversationManager.ts:13-23

TranscriptionManager

Handles audio input validation, size limits, and transcription via AI SDK models.

InputQueue

Serializes concurrent requests to prevent race conditions. All sendText() and sendAudio() calls are queued and processed one at a time.

Event-Driven Design

The agent uses Node.js EventEmitter to bubble events from managers to consumers:

// Event bubbling setup
this.bubbleEvents(this.speech, [
  'speech_start',
  'speech_complete',
  'speech_interrupted',
  'speech_chunk_queued',
  'audio_chunk',
]);

this.bubbleEvents(this.conversation, [
  'history_cleared',
  'history_trimmed',
]);

Key Events

Text Events

text - User input or assistant response
chunk:text_delta - Streaming text tokens
chunk:reasoning_delta - Reasoning tokens

Speech Events

speech_start - TTS begins
audio_chunk - Streaming audio ready
speech_interrupted - Barge-in occurred

Tool Events

chunk:tool_call - Tool invocation
tool_result - Tool execution complete

Lifecycle Events

connected / disconnected - WebSocket state
history_trimmed - Memory limit reached

Agent Lifecycle

Initialization

const agent = new VoiceAgent({
  model: openai('gpt-4o'),
  transcriptionModel: openai.transcription('whisper-1'),
  speechModel: openai.speech('gpt-4o-mini-tts'),
  instructions: 'You are a helpful assistant.',
  voice: 'alloy',
  streamingSpeech: {
    minChunkSize: 50,
    maxChunkSize: 200,
    parallelGeneration: true,
    maxParallelRequests: 3,
  },
  history: {
    maxMessages: 100,
    maxTotalChars: 50000,
  },
});

Connection Handling

Client-side (connect to server):

await agent.connect('ws://localhost:8080');

Server-side (accept connections):

wss.on('connection', (socket) => {
  agent.handleSocket(socket);
});

Input Processing

// Text input (bypasses transcription)
const response = await agent.sendText('What is the weather?');

// Audio input (base64)
await agent.sendAudio(base64AudioData);

// Audio input (buffer)
await agent.sendAudioBuffer(audioBuffer);

State Properties

agent.connected      // WebSocket connected?
agent.processing     // LLM stream active?
agent.speaking       // TTS generation active?
agent.pendingSpeechChunks  // Queued TTS chunks
agent.destroyed      // Agent destroyed?

Interruption Support

// Interrupt speech only (LLM keeps running)
agent.interruptSpeech('user_speaking');

// Interrupt both LLM stream and speech (barge-in)
agent.interruptCurrentResponse('user_speaking');

Cleanup

// Graceful disconnect (aborts in-flight work)
agent.disconnect();

// Permanent destruction (cannot be reused)
agent.destroy();

Processing Flow

When a user sends input (text or audio):

Input arrives via sendText(), sendAudio(), or WebSocket message
Queue serialization - Request enters InputQueue for serial processing
Audio transcription (if audio input) - TranscriptionManager converts to text
History management - User message added, trimming applied if needed
LLM streaming - streamText() called with conversation history
Speech chunking - Text streamed to SpeechManager for sentence extraction
Parallel TTS - Multiple chunks generate audio simultaneously
Sequential playback - Audio chunks sent in order via WebSocket
History update - Assistant response added to conversation
Queue next - Next queued request begins processing

Abort Controllers

The agent uses AbortController for clean cancellation:

// Current LLM stream
private currentStreamAbortController?: AbortController;

// Interrupt both stream and speech
public interruptCurrentResponse(reason: string = 'interrupted'): void {
  if (this.currentStreamAbortController) {
    this.currentStreamAbortController.abort();
    this.currentStreamAbortController = undefined;
  }
  this.speech.interruptSpeech(reason);
}

This ensures:

LLM generation stops immediately
In-flight TTS requests are cancelled
No wasted API calls or tokens
Fast response to user interruption

Error Handling

Errors are emitted via the error event:

agent.on('error', (error) => {
  console.error('Agent error:', error);
  // Handle transcription failures, TTS errors, etc.
});

agent.on('warning', (message) => {
  console.warn('Agent warning:', message);
  // Handle non-fatal issues (empty input, etc.)
});

Configuration Options

model

LanguageModel

required

AI SDK chat model (e.g., openai('gpt-4o'))

transcriptionModel

TranscriptionModel

AI SDK transcription model (e.g., openai.transcription('whisper-1'))

speechModel

SpeechModel

AI SDK speech model (e.g., openai.speech('gpt-4o-mini-tts'))

instructions

string

default:"You are a helpful voice assistant."

System prompt for the LLM

voice

string

default:"alloy"

TTS voice identifier

outputFormat

string

default:"opus"

Audio output format (mp3, opus, wav, etc.)

streamingSpeech

StreamingSpeechConfig

Configuration for streaming TTS behavior. See Streaming Speech.

history

HistoryConfig

Conversation memory limits. See Memory Management.

maxAudioInputSize

number

default:"10485760"

Maximum audio input size in bytes (default: 10 MB)

Next Steps

VideoAgent

Learn about vision model support and frame processing

Streaming Speech

Understand chunked TTS and parallel generation

Memory Management

Configure history limits and trimming behavior

WebSocket Protocol

Explore message types and protocol specification

Get Started

Core Concepts

Guides

Examples

Overview

Core Architecture

One Instance Per User

Manager-Based Architecture

WebSocketManager

SpeechManager

ConversationManager

TranscriptionManager

InputQueue

Event-Driven Design

Key Events

Text Events

Speech Events

Tool Events

Lifecycle Events

Agent Lifecycle

Initialization

Connection Handling

Input Processing

State Properties

Interruption Support

Cleanup

Processing Flow

Abort Controllers

Error Handling

Configuration Options

Next Steps

VideoAgent

Streaming Speech

Memory Management

WebSocket Protocol

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

​Overview

​Core Architecture

​One Instance Per User

​Manager-Based Architecture

​WebSocketManager

​SpeechManager

​ConversationManager

​TranscriptionManager

​InputQueue

​Event-Driven Design

​Key Events

Text Events

Speech Events

Tool Events

Lifecycle Events

​Agent Lifecycle

​Initialization

​Connection Handling

​Input Processing

​State Properties

​Interruption Support

​Cleanup

​Processing Flow

​Abort Controllers

​Error Handling

​Configuration Options

​Next Steps

VideoAgent

Streaming Speech

Memory Management

WebSocket Protocol

Build docs developers (and LLMs) love

Overview

Core Architecture

One Instance Per User

Manager-Based Architecture

WebSocketManager

SpeechManager

ConversationManager

TranscriptionManager

InputQueue

Event-Driven Design

Key Events

Agent Lifecycle

Initialization

Connection Handling

Input Processing

State Properties

Interruption Support

Cleanup

Processing Flow

Abort Controllers

Error Handling

Configuration Options

Next Steps