TranscriptionManager

Overview

The TranscriptionManager class manages speech-to-text (STT) transcription for voice agents. It handles incoming audio data, validates it, transcribes it using AI SDK transcription models (e.g., Whisper), and sends results to clients. Key responsibilities:

Transcribe audio data to text using AI SDK transcription models
Validate incoming audio size against configured limits
Decode base64-encoded audio from WebSocket messages
Send transcription results back to clients
Emit transcription events for agent processing

Location: src/core/TranscriptionManager.ts:17

Constructor

new TranscriptionManager(options?: TranscriptionManagerOptions)

options

TranscriptionManagerOptions

Optional configuration object

options.transcriptionModel

TranscriptionModel

AI SDK transcription model instance (e.g., openai.transcription('whisper-1'))

options.maxAudioInputSize

number

default:26214400

Maximum audio size in bytes (default: 25 MB). Audio larger than this will be rejected.

Example:

import { openai } from '@ai-sdk/openai';

const transcriptionManager = new TranscriptionManager({
  transcriptionModel: openai.transcription('whisper-1'),
  maxAudioInputSize: 10 * 1024 * 1024 // 10 MB
});

Properties

hasTranscriptionModel

get hasTranscriptionModel(): boolean

hasTranscriptionModel

boolean

Returns true if a transcription model is configured. If false, transcription requests will fail.

sendMessage

public sendMessage: (message: Record<string, unknown>) => void

sendMessage

(message: Record<string, unknown>) => void

required

Callback function to send messages over WebSocket. Must be set by the parent agent.

Example:

transcriptionManager.sendMessage = (msg) => {
  wsManager.send(msg);
};

Methods

transcribeAudio()

Transcribe audio data to text using the configured transcription model.

transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>

audioData

Buffer | Uint8Array

required

Raw audio data in a format supported by the transcription model (e.g., WAV, MP3, OGG)

Returns: Promise resolving to the transcribed text. Throws: Error if no transcription model is configured or if transcription fails. Example:

const audioBuffer = fs.readFileSync('recording.wav');
const text = await transcriptionManager.transcribeAudio(audioBuffer);
console.log('Transcribed:', text);

Events emitted:

transcription - When transcription succeeds

WebSocket messages sent:

transcription_result - Contains transcribed text and detected language

Implementation notes:

Uses AI SDK’s experimental_transcribe function
Logs audio size and transcription result
Returns the detected language in the transcription event
Automatically sends result to client via WebSocket

processAudioInput()

Process incoming base64-encoded audio: validate, decode, and transcribe.

processAudioInput(
  base64Audio: string,
  format?: string
): Promise<string | null>

base64Audio

string

required

Base64-encoded audio data from the client

format

string

Optional audio format hint (e.g., ‘wav’, ‘mp3’, ‘ogg’, ‘webm’)

Returns: Promise resolving to transcribed text, or null if validation fails or transcription returns empty. Example:

// Handle incoming audio from WebSocket
wsManager.on('message', async (msg) => {
  if (msg.type === 'audio_input') {
    const text = await transcriptionManager.processAudioInput(
      msg.data,
      msg.format
    );
    
    if (text) {
      console.log('User said:', text);
      await handleUserInput(text);
    }
  }
});

Validation steps:

Check model configured - Returns null if no transcription model
Decode base64 - Converts to Buffer
Validate size - Rejects if exceeds maxAudioInputSize
Check empty - Returns null if buffer is empty
Transcribe - Calls transcribeAudio()
Validate result - Returns null if transcription is empty

Events emitted:

audio_received - When audio is received and validated
transcription - When transcription succeeds
warning - For empty audio or empty transcription results
error - For size limits or transcription failures

WebSocket messages sent:

transcription_result - On success
transcription_error - On failure
error - For configuration errors

Events

The TranscriptionManager extends EventEmitter and emits the following events:

transcription

Emitted when audio is successfully transcribed.

transcriptionManager.on('transcription', (data) => {
  console.log(`Transcribed (${data.language}): ${data.text}`);
});

data.text

string

The transcribed text

data.language

string | undefined

The detected language code (e.g., ‘en’, ‘es’, ‘fr’) if available

audio_received

Emitted when audio data is received and validated.

transcriptionManager.on('audio_received', (data) => {
  console.log(`Received ${data.size} bytes of ${data.format} audio`);
});

data.size

number

Audio data size in bytes

data.format

string | undefined

Audio format if provided

warning

Emitted for non-critical issues (empty audio, empty transcription).

transcriptionManager.on('warning', (message) => {
  console.warn('Transcription warning:', message);
});

error

Emitted when transcription fails or validation errors occur.

transcriptionManager.on('error', (error) => {
  console.error('Transcription error:', error);
});

WebSocket Message Protocol

Incoming Messages

Audio input from client:

{
  "type": "audio_input",
  "data": "<base64-encoded-audio>",
  "format": "webm" // optional
}

Outgoing Messages

Transcription result:

{
  "type": "transcription_result",
  "text": "Hello, how are you?",
  "language": "en"
}

Transcription error:

{
  "type": "transcription_error",
  "error": "Transcription failed: Model timeout"
}

Configuration error:

{
  "type": "error",
  "error": "Transcription model not configured for audio input"
}

Usage in Agent Architecture

class VoiceAgent {
  private transcriptionManager: TranscriptionManager;
  private wsManager: WebSocketManager;

  constructor(config: VoiceAgentConfig) {
    this.transcriptionManager = new TranscriptionManager({
      transcriptionModel: config.transcriptionModel,
      maxAudioInputSize: config.maxAudioInputSize
    });
    
    // Connect to WebSocket
    this.transcriptionManager.sendMessage = (msg) => {
      this.wsManager.send(msg);
    };
    
    // Handle transcription results
    this.transcriptionManager.on('transcription', (data) => {
      console.log(`User said: ${data.text}`);
      this.processUserInput(data.text);
    });
    
    // Handle transcription errors
    this.transcriptionManager.on('error', (error) => {
      console.error('Transcription failed:', error);
    });
  }

  private setupWebSocket() {
    this.wsManager.on('message', async (msg) => {
      if (msg.type === 'audio_input') {
        // Interrupt agent speech when user starts speaking
        if (this.speechManager.isSpeaking) {
          this.speechManager.interruptSpeech('user_spoke');
        }
        
        // Process the audio
        const text = await this.transcriptionManager.processAudioInput(
          msg.data,
          msg.format
        );
        
        if (text) {
          await this.handleUserInput(text);
        }
      }
    });
  }
}

Audio Format Support

The TranscriptionManager supports various audio formats depending on the underlying transcription model:

Common Formats (Whisper)

WAV - Uncompressed audio (large file size)
MP3 - Compressed, widely supported
OGG/Opus - Efficient compression, good for real-time
WebM - Modern format, browser-friendly
FLAC - Lossless compression

The actual format support depends on the transcription model provider. OpenAI’s Whisper supports most common formats.

Size Limits and Validation

Default Limit

The default maxAudioInputSize is 25 MB (26,214,400 bytes).

Choosing the Right Limit

maxAudioInputSize

number

Consider these factors:

API limits: OpenAI Whisper has a 25 MB limit
Network latency: Larger files take longer to upload
Use case: Short voice messages vs. long recordings
Format: Compressed formats (MP3, Opus) are smaller than WAV

Example limits:

// Short voice messages (< 30 seconds)
maxAudioInputSize: 5 * 1024 * 1024 // 5 MB

// Standard voice input (< 2 minutes)
maxAudioInputSize: 10 * 1024 * 1024 // 10 MB

// Extended recordings (< 10 minutes)
maxAudioInputSize: 25 * 1024 * 1024 // 25 MB

Size Validation Flow

// 1. Decode base64
const audioBuffer = Buffer.from(base64Audio, 'base64');

// 2. Check size
if (audioBuffer.length > this.maxAudioInputSize) {
  const sizeMB = (audioBuffer.length / (1024 * 1024)).toFixed(1);
  const maxMB = (this.maxAudioInputSize / (1024 * 1024)).toFixed(1);
  emit('error', new Error(
    `Audio input too large (${sizeMB} MB). Maximum allowed: ${maxMB} MB`
  ));
  return null;
}

// 3. Check empty
if (audioBuffer.length === 0) {
  emit('warning', 'Received empty audio data');
  return null;
}

Error Handling

Configuration Errors

if (!transcriptionModel) {
  throw new Error('Transcription model not configured');
}

Transcription Errors

try {
  const result = await transcribe({ model, audio });
} catch (error) {
  console.error('Whisper transcription failed:', error);
  throw error; // Propagates to caller
}

Empty Result Handling

if (!transcribedText.trim()) {
  emit('warning', 'Transcription returned empty text');
  sendMessage({
    type: 'transcription_error',
    error: 'Whisper returned empty text'
  });
  return null;
}

Performance Considerations

Latency

Network upload: Depends on audio size and connection speed
API processing: Typically 1-3 seconds for Whisper
Total latency: Usually 2-5 seconds for short audio clips

Optimization Tips

Use compressed formats - Opus/MP3 are much smaller than WAV
Keep recordings short - Under 30 seconds for best UX
Show loading indicators - Transcription is not instant
Handle errors gracefully - Provide fallback UI for failures

WebSocketManager - Receives audio data from clients
SpeechManager - Works with TranscriptionManager for barge-in
VoiceAgent - Orchestrates transcription with conversation flow
AI SDK Transcription - Underlying transcription API

Agents

Core Managers

Types & Interfaces

Resources

Overview

Constructor

Properties

hasTranscriptionModel

sendMessage

Methods

transcribeAudio()

processAudioInput()

Events

transcription

audio_received

warning

error

WebSocket Message Protocol

Incoming Messages

Outgoing Messages

Usage in Agent Architecture

Audio Format Support

Common Formats (Whisper)

Size Limits and Validation

Default Limit

Choosing the Right Limit

Size Validation Flow

Error Handling

Configuration Errors

Transcription Errors

Empty Result Handling

Performance Considerations

Latency

Optimization Tips

Build docs developers (and LLMs) love

Agents

Core Managers

Types & Interfaces

Resources

​Overview

​Constructor

​Properties

​hasTranscriptionModel

​sendMessage

​Methods

​transcribeAudio()

​processAudioInput()

​Events

​transcription

​audio_received

​warning

​error

​WebSocket Message Protocol

​Incoming Messages

​Outgoing Messages

​Usage in Agent Architecture

​Audio Format Support

​Common Formats (Whisper)

​Size Limits and Validation

​Default Limit

​Choosing the Right Limit

​Size Validation Flow

​Error Handling

​Configuration Errors

​Transcription Errors

​Empty Result Handling

​Performance Considerations

​Latency

​Optimization Tips

​Related

Build docs developers (and LLMs) love

Overview

Constructor

Properties

hasTranscriptionModel

sendMessage

Methods

transcribeAudio()

processAudioInput()

Events

transcription

audio_received

warning

error

WebSocket Message Protocol

Incoming Messages

Outgoing Messages

Usage in Agent Architecture

Audio Format Support

Common Formats (Whisper)

Size Limits and Validation

Default Limit

Choosing the Right Limit

Size Validation Flow

Error Handling

Configuration Errors

Transcription Errors

Empty Result Handling

Performance Considerations

Latency

Optimization Tips

Related