Overview
TheVoiceAgent is the primary class for building voice-enabled AI assistants. It orchestrates streaming text generation, audio transcription, text-to-speech synthesis, and WebSocket communication through a modular, event-driven architecture.
Core Architecture
One Instance Per User
Critical: EachVoiceAgent instance is designed for one user per instance. The agent maintains its own:
- Conversation history
- Input queue
- Speech state
- WebSocket connection
Manager-Based Architecture
TheVoiceAgent delegates responsibilities to specialized managers:
WebSocketManager
Handles WebSocket lifecycle, connection state, and message routing.SpeechManager
Manages streaming speech generation, chunk queueing, and parallel TTS requests.ConversationManager
Maintains conversation history with automatic trimming based onHistoryConfig.
TranscriptionManager
Handles audio input validation, size limits, and transcription via AI SDK models.InputQueue
Serializes concurrent requests to prevent race conditions. AllsendText() and sendAudio() calls are queued and processed one at a time.
Event-Driven Design
The agent uses Node.jsEventEmitter to bubble events from managers to consumers:
Key Events
Text Events
text- User input or assistant responsechunk:text_delta- Streaming text tokenschunk:reasoning_delta- Reasoning tokens
Speech Events
speech_start- TTS beginsaudio_chunk- Streaming audio readyspeech_interrupted- Barge-in occurred
Tool Events
chunk:tool_call- Tool invocationtool_result- Tool execution complete
Lifecycle Events
connected/disconnected- WebSocket statehistory_trimmed- Memory limit reached
Agent Lifecycle
Initialization
Connection Handling
Client-side (connect to server):Input Processing
State Properties
Interruption Support
Cleanup
Processing Flow
When a user sends input (text or audio):- Input arrives via
sendText(),sendAudio(), or WebSocket message - Queue serialization - Request enters
InputQueuefor serial processing - Audio transcription (if audio input) - TranscriptionManager converts to text
- History management - User message added, trimming applied if needed
- LLM streaming -
streamText()called with conversation history - Speech chunking - Text streamed to
SpeechManagerfor sentence extraction - Parallel TTS - Multiple chunks generate audio simultaneously
- Sequential playback - Audio chunks sent in order via WebSocket
- History update - Assistant response added to conversation
- Queue next - Next queued request begins processing
Abort Controllers
The agent usesAbortController for clean cancellation:
- LLM generation stops immediately
- In-flight TTS requests are cancelled
- No wasted API calls or tokens
- Fast response to user interruption
Error Handling
Errors are emitted via theerror event:
Configuration Options
AI SDK chat model (e.g.,
openai('gpt-4o'))AI SDK transcription model (e.g.,
openai.transcription('whisper-1'))AI SDK speech model (e.g.,
openai.speech('gpt-4o-mini-tts'))System prompt for the LLM
TTS voice identifier
Audio output format (
mp3, opus, wav, etc.)Configuration for streaming TTS behavior. See Streaming Speech.
Conversation memory limits. See Memory Management.
Maximum audio input size in bytes (default: 10 MB)
Next Steps
VideoAgent
Learn about vision model support and frame processing
Streaming Speech
Understand chunked TTS and parallel generation
Memory Management
Configure history limits and trimming behavior
WebSocket Protocol
Explore message types and protocol specification