SpeechManager

Overview

The SpeechManager class handles all text-to-speech (TTS) operations for voice agents. It supports both full-text and streaming speech generation, parallel TTS requests for reduced latency, and speech interruption for natural conversation flow (barge-in). Key responsibilities:

Generate speech from text using AI SDK speech models
Stream speech as text arrives from LLM (sentence-by-sentence)
Manage parallel TTS requests to minimize latency
Handle speech interruption when user speaks
Queue and process speech chunks in order
Extract sentences from streaming text with intelligent chunking

Location: src/core/SpeechManager.ts:24

Constructor

new SpeechManager(options: SpeechManagerOptions)

options

SpeechManagerOptions

required

Configuration object for the speech manager

options.speechModel

SpeechModel

required

AI SDK speech model instance (e.g., openai.speech('tts-1'))

options.voice

string

default:"alloy"

Voice identifier for the TTS model (e.g., ‘alloy’, ‘echo’, ‘nova’)

options.speechInstructions

string

Optional instructions to guide speech generation (model-specific)

options.outputFormat

string

default:"opus"

Audio format for generated speech (e.g., ‘opus’, ‘mp3’, ‘pcm’)

options.streamingSpeech

Partial<StreamingSpeechConfig>

Configuration for streaming speech behavior:

minChunkSize: Minimum characters before generating speech (default: 50)
maxChunkSize: Maximum characters in a chunk (default: 300)
parallelGeneration: Enable parallel TTS requests (default: true)
maxParallelRequests: Max concurrent TTS requests (default: 3)

Properties

isSpeaking

get isSpeaking(): boolean

isSpeaking

boolean

Returns true if the manager is currently processing and sending speech chunks.

pendingChunkCount

get pendingChunkCount(): number

pendingChunkCount

number

Returns the number of speech chunks queued for generation/sending.

hasSpeechModel

get hasSpeechModel(): boolean

hasSpeechModel

boolean

Returns true if a speech model is configured. If false, no TTS will be generated.

queueDonePromise

get queueDonePromise(): Promise<void> | undefined

queueDonePromise

Promise<void> | undefined

Returns a Promise that resolves when the speech queue is fully drained, or undefined if no speech is queued.

sendMessage

public sendMessage: (message: Record<string, unknown>) => void

sendMessage

(message: Record<string, unknown>) => void

required

Callback function to send messages over WebSocket. Must be set by the parent agent.

Methods

generateSpeechFromText()

Generate speech audio from text using the configured speech model.

generateSpeechFromText(
  text: string,
  abortSignal?: AbortSignal
): Promise<Uint8Array>

text

string

required

The text to convert to speech

abortSignal

AbortSignal

Optional AbortSignal to cancel the generation request

Returns: Promise resolving to audio data as Uint8Array. Throws: Error if no speech model is configured. Example:

const audioData = await speechManager.generateSpeechFromText(
  'Hello, how can I help you?'
);

generateAndSendSpeechFull()

Generate speech for full text at once (non-streaming fallback).

generateAndSendSpeechFull(text: string): Promise<void>

text

string

required

The complete text to convert to speech

Returns: Promise that resolves when audio is generated and sent. Events emitted:

speech_start - Before generation begins
audio - When audio is ready
speech_complete - After sending audio
error - If generation fails

Example:

await speechManager.generateAndSendSpeechFull(
  'The weather today is sunny with a high of 75 degrees.'
);

processTextDelta()

Process a text chunk from streaming LLM output. Automatically extracts sentences and queues speech generation.

processTextDelta(textDelta: string): void

textDelta

string

required

A chunk of text from the LLM stream

Example:

// Called as text arrives from LLM
for await (const chunk of llmStream) {
  speechManager.processTextDelta(chunk.text);
}

Implementation details:

Accumulates text in internal buffer
Extracts complete sentences using regex patterns
Respects minChunkSize and maxChunkSize config
Handles incomplete sentences gracefully
Automatically starts speech queue processing

flushPendingText()

Flush any remaining text in the buffer to speech. Call this when the LLM stream ends.

flushPendingText(): void

Example:

// After LLM stream completes
for await (const chunk of llmStream) {
  speechManager.processTextDelta(chunk.text);
}
speechManager.flushPendingText(); // Generate speech for remaining text

interruptSpeech()

Interrupt ongoing speech generation and playback (barge-in support).

interruptSpeech(reason?: string): void

reason

string

default:"interrupted"

Reason for interruption (e.g., ‘user_spoke’, ‘timeout’)

Effects:

Aborts all pending TTS generation requests
Clears the speech queue
Clears the pending text buffer
Sends speech_interrupted message to client
Resolves the queueDonePromise immediately

Example:

// When user starts speaking
transcriptionManager.on('audio_received', () => {
  speechManager.interruptSpeech('user_spoke');
});

reset()

Reset all speech state (used on disconnect or cleanup).

reset(): void

Effects:

Aborts current speech generation
Clears all queues and buffers
Resets speaking state
Resolves pending promises

Example:

// On agent disconnect
agent.on('disconnected', () => {
  speechManager.reset();
});

Events

The SpeechManager extends EventEmitter and emits the following events:

speech_start

Emitted when speech generation begins.

speechManager.on('speech_start', (data) => {
  console.log('Streaming:', data.streaming);
});

data.text

string

The text being spoken (only for full-text mode)

data.streaming

boolean

true for streaming mode, false for full-text mode

speech_complete

Emitted when all speech has been generated and sent.

speechManager.on('speech_complete', (data) => {
  console.log('Speech done, streaming:', data.streaming);
});

speech_interrupted

Emitted when speech is interrupted.

speechManager.on('speech_interrupted', (data) => {
  console.log('Interrupted:', data.reason);
});

speech_chunk_queued

Emitted when a speech chunk is queued for generation.

speechManager.on('speech_chunk_queued', (data) => {
  console.log(`Chunk ${data.id}: ${data.text}`);
});

audio

Emitted when audio is generated (full-text mode).

speechManager.on('audio', (data) => {
  console.log(`Audio: ${data.format}, ${data.data.length} bytes`);
});

data.data

string

Base64-encoded audio data

data.format

string

Audio format (e.g., ‘opus’, ‘mp3’)

data.uint8Array

Uint8Array

Raw audio data as Uint8Array

audio_chunk

Emitted when a speech chunk is ready (streaming mode).

speechManager.on('audio_chunk', (data) => {
  console.log(`Chunk ${data.chunkId}: ${data.text}`);
});

data.chunkId

number

Sequential chunk identifier

data.text

string

The text that was converted to speech

data.data

string

Base64-encoded audio data

data.format

string

Audio format

data.uint8Array

Uint8Array

Raw audio data

error

Emitted when speech generation fails.

speechManager.on('error', (error) => {
  console.error('Speech error:', error);
});

Streaming Speech Architecture

The SpeechManager implements intelligent streaming speech with parallel generation:

Sentence Extraction

Accumulates text in pendingTextBuffer as deltas arrive
Extracts sentences using regex: /[.!?]+(?:\s+|$)/g
Respects chunk size limits (minChunkSize: 50, maxChunkSize: 300)
Handles clause splitting for long sentences using /[,;:]\s+/g
Merges short sentences to avoid tiny speech chunks

Parallel Generation

When parallelGeneration: true (default):

Starts TTS immediately when sentence is extracted
Limits concurrent requests (default: 3 parallel requests)
Maintains order - sends chunks in sequence even if generated out of order
Reduces latency - next chunk is ready by the time previous finishes

Queue Processing

// Internal queue processing flow
while (speechChunkQueue.length > 0) {
  const chunk = speechChunkQueue[0];
  
  // Wait for this chunk's audio (may already be ready)
  const audioData = await chunk.audioPromise;
  
  // Send to client
  sendMessage({ type: 'audio_chunk', data, chunkId: chunk.id });
  
  // Remove from queue
  speechChunkQueue.shift();
  
  // Start generating more chunks in parallel
  startNextChunks();
}

Usage in Agent Architecture

class VoiceAgent {
  private speechManager: SpeechManager;

  constructor(config: VoiceAgentConfig) {
    this.speechManager = new SpeechManager({
      speechModel: config.speechModel,
      voice: config.voice,
      streamingSpeech: {
        parallelGeneration: true,
        maxParallelRequests: 3
      }
    });
    
    // Connect to WebSocket
    this.speechManager.sendMessage = (msg) => {
      this.wsManager.send(msg);
    };
  }

  async handleUserInput(text: string) {
    const stream = await this.generateLLMResponse(text);
    
    // Process streaming text for speech
    for await (const chunk of stream) {
      this.speechManager.processTextDelta(chunk.text);
    }
    
    // Flush remaining text
    this.speechManager.flushPendingText();
    
    // Wait for speech to complete
    await this.speechManager.queueDonePromise;
  }
}

Performance Optimization

Parallel generation significantly reduces perceived latency. With 3 parallel requests and 2-second TTS latency, the user hears speech starting ~2 seconds after the first sentence, then continuous audio without gaps.

Tuning Parameters

minChunkSize: 50 - Shorter = faster start, but more requests
maxChunkSize: 300 - Prevents extremely long TTS requests
maxParallelRequests: 3 - Balance between latency and API load
parallelGeneration: true - Essential for low-latency streaming

WebSocketManager - Sends speech audio to clients
StreamProcessor - Provides text deltas for speech generation
VoiceAgent - Orchestrates speech generation with LLM responses

Agents

Core Managers

Types & Interfaces

Resources

Overview

Constructor

Properties

isSpeaking

pendingChunkCount

hasSpeechModel

queueDonePromise

sendMessage

Methods

generateSpeechFromText()

generateAndSendSpeechFull()

processTextDelta()

flushPendingText()

interruptSpeech()

reset()

Events

speech_start

speech_complete

speech_interrupted

speech_chunk_queued

audio

audio_chunk

error

Streaming Speech Architecture

Sentence Extraction

Parallel Generation

Queue Processing

Usage in Agent Architecture

Performance Optimization

Tuning Parameters

Build docs developers (and LLMs) love

Agents

Core Managers

Types & Interfaces

Resources

​Overview

​Constructor

​Properties

​isSpeaking

​pendingChunkCount

​hasSpeechModel

​queueDonePromise

​sendMessage

​Methods

​generateSpeechFromText()

​generateAndSendSpeechFull()

​processTextDelta()

​flushPendingText()

​interruptSpeech()

​reset()

​Events

​speech_start

​speech_complete

​speech_interrupted

​speech_chunk_queued

​audio

​audio_chunk

​error

​Streaming Speech Architecture

​Sentence Extraction

​Parallel Generation

​Queue Processing

​Usage in Agent Architecture

​Performance Optimization

​Tuning Parameters

​Related

Build docs developers (and LLMs) love

Overview

Constructor

Properties

isSpeaking

pendingChunkCount

hasSpeechModel

queueDonePromise

sendMessage

Methods

generateSpeechFromText()

generateAndSendSpeechFull()

processTextDelta()

flushPendingText()

interruptSpeech()

reset()

Events

speech_start

speech_complete

speech_interrupted

speech_chunk_queued

audio

audio_chunk

error

Streaming Speech Architecture

Sentence Extraction

Parallel Generation

Queue Processing

Usage in Agent Architecture

Performance Optimization

Tuning Parameters

Related