Overview
Streaming speech synthesis splits LLM-generated text into sentence-sized chunks and converts them to audio in parallel as the text streams in. This dramatically reduces time-to-first-audio compared to waiting for the full response.
Why Streaming Speech?
Low Latency First audio plays in ~1-2 seconds instead of 10-30 seconds
Natural Flow Audio starts speaking before LLM finishes generating
Parallel TTS Multiple chunks generate audio simultaneously
Interruptible Barge-in cancels pending chunks instantly
How It Works
The SpeechManager processes text deltas from streamText() in real-time:
Processing Flow
Text delta arrives from LLM (e.g., “Hello! How”)
Buffer accumulates text until sentence boundary found
Sentence extracted when . ! ? detected (e.g., “Hello!”)
Chunk queued for TTS generation
Parallel generation starts immediately (up to maxParallelRequests)
Audio ready - chunk sent to client via WebSocket
Sequential playback - client plays chunks in order
Repeat for next sentence
Configuration
StreamingSpeechConfig
interface StreamingSpeechConfig {
/** Minimum characters before generating speech for a chunk */
minChunkSize : number ;
/** Maximum characters per chunk (will split at sentence boundary before this) */
maxChunkSize : number ;
/** Whether to enable parallel TTS generation */
parallelGeneration : boolean ;
/** Maximum number of parallel TTS requests */
maxParallelRequests : number ;
}
Default Configuration
const DEFAULT_STREAMING_SPEECH_CONFIG = {
minChunkSize: 50 ,
maxChunkSize: 200 ,
parallelGeneration: true ,
maxParallelRequests: 3 ,
};
Customizing
import { VoiceAgent } from 'voice-agent-ai-sdk' ;
import { openai } from '@ai-sdk/openai' ;
const agent = new VoiceAgent ({
model: openai ( 'gpt-4o' ),
speechModel: openai . speech ( 'gpt-4o-mini-tts' ),
streamingSpeech: {
minChunkSize: 40 , // Start TTS after 40 chars (faster, more chunks)
maxChunkSize: 180 , // Force split at 180 chars (smaller chunks)
parallelGeneration: true ,
maxParallelRequests: 2 , // Limit concurrent TTS requests
},
});
Sentence Boundary Detection
The SpeechManager extracts sentences using pattern matching:
// Match sentences ending with . ! ? followed by space or end of string
const sentenceEndPattern = / [ .!? ] + (?: \s + |$ ) / g ;
Examples
Text Buffer Extracted Remaining "Hello! How are you""Hello!""How are you""I'm fine. Thanks!""I'm fine.""Thanks!""Wait...""Wait...""""Almost there"[]"Almost there" (no boundary yet)
Minimum Chunk Size
Sentences shorter than minChunkSize are appended to the previous chunk :
if ( sentence . length >= this . streamingSpeechConfig . minChunkSize ) {
sentences . push ( sentence );
} else if ( sentences . length > 0 ) {
// Append to previous sentence
sentences [ sentences . length - 1 ] += ' ' + sentence ;
}
This prevents generating TTS for very short fragments like "Yes." or "Ok!".
Clause Splitting (Fallback)
If remaining text exceeds maxChunkSize without a sentence boundary, the manager force-splits at clause boundaries (, ; :):
if ( remaining . length > this . streamingSpeechConfig . maxChunkSize ) {
const clausePattern = / [ ,;: ] \s + / g ;
// Find first clause boundary after minChunkSize
// Split there to avoid excessively long chunks
}
Example:
Input: "The quick brown fox jumps over the lazy dog, and then it runs through the forest, chasing a rabbit, until it reaches the river"
maxChunkSize = 100
Chunks:
1. "The quick brown fox jumps over the lazy dog,"
2. "and then it runs through the forest,"
3. "chasing a rabbit, until it reaches the river"
Parallel TTS Generation
When parallelGeneration is enabled, the manager starts generating audio for upcoming chunks while the current chunk plays:
// Start generating next chunks in parallel
if ( this . streamingSpeechConfig . parallelGeneration ) {
const activeRequests = this . speechChunkQueue . filter (
( c ) => c . audioPromise
). length ;
const toStart = Math . min (
this . streamingSpeechConfig . maxParallelRequests - activeRequests ,
this . speechChunkQueue . length
);
if ( toStart > 0 ) {
for ( let i = 0 ; i < toStart ; i ++ ) {
const nextChunk = this . speechChunkQueue . find ( c => ! c . audioPromise );
if ( nextChunk ) {
nextChunk . audioPromise = this . generateChunkAudio ( nextChunk );
}
}
}
}
Benefits
Reduced wait time between chunks (audio ready when previous finishes)
Better throughput for long responses
Configurable concurrency to control API rate limits
Limits
maxParallelRequests controls how many TTS requests run concurrently:
Too low (1): No parallelism, chunks generate sequentially
Optimal (2-3): Good balance of speed and API usage
Too high (5+): May hit rate limits, no significant gain
Speech Queue
Chunks are queued and processed in order:
interface SpeechChunk {
id : number ; // Sequential chunk ID
text : string ; // Text to convert to speech
audioPromise ?: Promise < Uint8Array | null >; // TTS generation promise
}
private speechChunkQueue : SpeechChunk [] = [];
private nextChunkId = 0 ;
Queue Processing
while ( this . speechChunkQueue . length > 0 ) {
const chunk = this . speechChunkQueue [ 0 ];
// Ensure audio generation has started
if ( ! chunk . audioPromise ) {
chunk . audioPromise = this . generateChunkAudio ( chunk );
}
// Wait for this chunk's audio
const audioData = await chunk . audioPromise ;
// Check if interrupted while waiting
if ( ! this . _isSpeaking ) break ;
// Send chunk to client
this . sendMessage ({
type: 'audio_chunk' ,
chunkId: chunk . id ,
data: base64Audio ,
format: this . outputFormat ,
text: chunk . text ,
});
// Remove from queue
this . speechChunkQueue . shift ();
// Start generating next chunks in parallel
// ...
}
Events
A text chunk was queued for TTS generation {
id : number ; // Chunk ID
text : string ; // Chunk text
}
TTS audio for a chunk is ready {
chunkId : number ; // Matches speech_chunk_queued id
data : string ; // Base64-encoded audio
format : string ; // 'mp3', 'opus', 'wav', etc.
text : string ; // Original chunk text
uint8Array : Uint8Array ; // Raw audio bytes
}
TTS generation started {
streaming : boolean ; // true for chunked, false for full text
}
Speech generation cancelled (barge-in) {
reason : string ; // 'user_speaking', 'interrupted', etc.
}
Interruption Support
The agent can interrupt ongoing speech generation:
// Interrupt speech only (LLM keeps running)
agent . interruptSpeech ( 'user_speaking' );
// Interrupt both LLM stream and speech (full barge-in)
agent . interruptCurrentResponse ( 'user_speaking' );
What Happens on Interrupt
Abort controller cancels all pending TTS requests
Speech queue is cleared
Pending text buffer is emptied
Client receives speech_interrupted message
Client stops playing audio immediately
interruptSpeech ( reason : string = 'interrupted' ): void {
// Abort pending TTS generation
if ( this . currentSpeechAbortController ) {
this . currentSpeechAbortController . abort ();
}
// Clear queue and state
this . speechChunkQueue = [];
this . pendingTextBuffer = '' ;
this . _isSpeaking = false ;
// Notify client
this . sendMessage ({
type: 'speech_interrupted' ,
reason ,
});
}
Non-Streaming Fallback
For full-text TTS (non-chunked):
await agent . generateAndSendSpeechFull ( 'Complete response text here.' );
This generates audio for the entire text at once (higher latency, simpler).
Example: Listening to Speech Events
import { VoiceAgent } from 'voice-agent-ai-sdk' ;
import { openai } from '@ai-sdk/openai' ;
import fs from 'fs' ;
const agent = new VoiceAgent ({
model: openai ( 'gpt-4o' ),
speechModel: openai . speech ( 'gpt-4o-mini-tts' ),
streamingSpeech: {
minChunkSize: 40 ,
maxChunkSize: 180 ,
parallelGeneration: true ,
maxParallelRequests: 2 ,
},
});
// Track chunk generation
let chunkCount = 0 ;
agent . on ( 'speech_chunk_queued' , ({ id , text }) => {
console . log ( `Chunk # ${ id } queued: " ${ text . substring ( 0 , 50 ) } ..."` );
chunkCount ++ ;
});
// Save audio chunks to files
agent . on ( 'audio_chunk' , ({ chunkId , uint8Array , format }) => {
fs . writeFileSync ( `chunk_ ${ chunkId } . ${ format } ` , uint8Array );
console . log ( `Chunk # ${ chunkId } saved ( ${ uint8Array . length } bytes)` );
});
agent . on ( 'speech_complete' , () => {
console . log ( `Speech complete! Generated ${ chunkCount } chunks.` );
});
// Test it
await agent . sendText ( 'Tell me a story about a robot learning to paint.' );
For Lower Latency
streamingSpeech : {
minChunkSize : 30 , // Start TTS sooner
maxChunkSize : 150 , // Smaller chunks
parallelGeneration : true ,
maxParallelRequests : 3 , // More parallelism
}
Trade-offs:
More API requests (higher cost)
More network messages
Potentially choppy if chunks too small
For Lower Cost
streamingSpeech : {
minChunkSize : 80 , // Larger chunks
maxChunkSize : 300 ,
parallelGeneration : true ,
maxParallelRequests : 1 , // Sequential generation
}
Trade-offs:
Higher latency between chunks
Fewer API requests
Less responsive to interruption
For Voice Quality
streamingSpeech : {
minChunkSize : 60 ,
maxChunkSize : 200 , // ~1-2 sentences per chunk
parallelGeneration : true ,
maxParallelRequests : 2 ,
}
Balances natural pacing with responsiveness.
Limitations
Sentence detection is heuristic-based and may split incorrectly on:
Abbreviations (“Dr. Smith”)
Decimals (“3.14”)
Ellipses (“Wait…maybe”)
For most conversational use cases, this is acceptable.
Parallel generation requires AbortSignal support in your TTS provider. The AI SDK’s experimental_generateSpeech supports this via abortSignal.
Best Practices
The default configuration works well for most use cases: streamingSpeech : {
minChunkSize : 50 ,
maxChunkSize : 200 ,
parallelGeneration : true ,
maxParallelRequests : 3 ,
}
Log speech_chunk_queued events to see how many chunks are generated: agent . on ( 'speech_chunk_queued' , ({ id }) => {
console . log ( `Chunk # ${ id } queued` );
});
If you see too many chunks (>10 for a typical response), increase minChunkSize.
Always support barge-in for better UX: agent . on ( 'speech_interrupted' , ({ reason }) => {
console . log ( `Speech interrupted: ${ reason } ` );
});
Different LLMs have different output styles. Test with your actual prompts to tune chunk sizes.
Next Steps
VoiceAgent Learn about the voice agent architecture
Memory Management Configure conversation history limits
WebSocket Protocol Explore speech message types
API Reference Full VoiceAgent API documentation