Overview
TheSpeechManager class handles all text-to-speech (TTS) operations for voice agents. It supports both full-text and streaming speech generation, parallel TTS requests for reduced latency, and speech interruption for natural conversation flow (barge-in).
Key responsibilities:
- Generate speech from text using AI SDK speech models
- Stream speech as text arrives from LLM (sentence-by-sentence)
- Manage parallel TTS requests to minimize latency
- Handle speech interruption when user speaks
- Queue and process speech chunks in order
- Extract sentences from streaming text with intelligent chunking
src/core/SpeechManager.ts:24
Constructor
Configuration object for the speech manager
AI SDK speech model instance (e.g.,
openai.speech('tts-1'))Voice identifier for the TTS model (e.g., ‘alloy’, ‘echo’, ‘nova’)
Optional instructions to guide speech generation (model-specific)
Audio format for generated speech (e.g., ‘opus’, ‘mp3’, ‘pcm’)
Configuration for streaming speech behavior:
minChunkSize: Minimum characters before generating speech (default: 50)maxChunkSize: Maximum characters in a chunk (default: 300)parallelGeneration: Enable parallel TTS requests (default: true)maxParallelRequests: Max concurrent TTS requests (default: 3)
Properties
isSpeaking
Returns
true if the manager is currently processing and sending speech chunks.pendingChunkCount
Returns the number of speech chunks queued for generation/sending.
hasSpeechModel
Returns
true if a speech model is configured. If false, no TTS will be generated.queueDonePromise
Returns a Promise that resolves when the speech queue is fully drained, or
undefined if no speech is queued.sendMessage
Callback function to send messages over WebSocket. Must be set by the parent agent.
Methods
generateSpeechFromText()
Generate speech audio from text using the configured speech model.The text to convert to speech
Optional AbortSignal to cancel the generation request
Uint8Array.
Throws: Error if no speech model is configured.
Example:
generateAndSendSpeechFull()
Generate speech for full text at once (non-streaming fallback).The complete text to convert to speech
speech_start- Before generation beginsaudio- When audio is readyspeech_complete- After sending audioerror- If generation fails
processTextDelta()
Process a text chunk from streaming LLM output. Automatically extracts sentences and queues speech generation.A chunk of text from the LLM stream
- Accumulates text in internal buffer
- Extracts complete sentences using regex patterns
- Respects
minChunkSizeandmaxChunkSizeconfig - Handles incomplete sentences gracefully
- Automatically starts speech queue processing
flushPendingText()
Flush any remaining text in the buffer to speech. Call this when the LLM stream ends.interruptSpeech()
Interrupt ongoing speech generation and playback (barge-in support).Reason for interruption (e.g., ‘user_spoke’, ‘timeout’)
- Aborts all pending TTS generation requests
- Clears the speech queue
- Clears the pending text buffer
- Sends
speech_interruptedmessage to client - Resolves the
queueDonePromiseimmediately
reset()
Reset all speech state (used on disconnect or cleanup).- Aborts current speech generation
- Clears all queues and buffers
- Resets speaking state
- Resolves pending promises
Events
TheSpeechManager extends EventEmitter and emits the following events:
speech_start
Emitted when speech generation begins.The text being spoken (only for full-text mode)
true for streaming mode, false for full-text modespeech_complete
Emitted when all speech has been generated and sent.speech_interrupted
Emitted when speech is interrupted.speech_chunk_queued
Emitted when a speech chunk is queued for generation.audio
Emitted when audio is generated (full-text mode).Base64-encoded audio data
Audio format (e.g., ‘opus’, ‘mp3’)
Raw audio data as Uint8Array
audio_chunk
Emitted when a speech chunk is ready (streaming mode).Sequential chunk identifier
The text that was converted to speech
Base64-encoded audio data
Audio format
Raw audio data
error
Emitted when speech generation fails.Streaming Speech Architecture
The SpeechManager implements intelligent streaming speech with parallel generation:Sentence Extraction
- Accumulates text in
pendingTextBufferas deltas arrive - Extracts sentences using regex:
/[.!?]+(?:\s+|$)/g - Respects chunk size limits (minChunkSize: 50, maxChunkSize: 300)
- Handles clause splitting for long sentences using
/[,;:]\s+/g - Merges short sentences to avoid tiny speech chunks
Parallel Generation
WhenparallelGeneration: true (default):
- Starts TTS immediately when sentence is extracted
- Limits concurrent requests (default: 3 parallel requests)
- Maintains order - sends chunks in sequence even if generated out of order
- Reduces latency - next chunk is ready by the time previous finishes
Queue Processing
Usage in Agent Architecture
Performance Optimization
Tuning Parameters
- minChunkSize: 50 - Shorter = faster start, but more requests
- maxChunkSize: 300 - Prevents extremely long TTS requests
- maxParallelRequests: 3 - Balance between latency and API load
- parallelGeneration: true - Essential for low-latency streaming
Related
- WebSocketManager - Sends speech audio to clients
- StreamProcessor - Provides text deltas for speech generation
- VoiceAgent - Orchestrates speech generation with LLM responses