Overview
TheTranscriptionManager class manages speech-to-text (STT) transcription for voice agents. It handles incoming audio data, validates it, transcribes it using AI SDK transcription models (e.g., Whisper), and sends results to clients.
Key responsibilities:
- Transcribe audio data to text using AI SDK transcription models
- Validate incoming audio size against configured limits
- Decode base64-encoded audio from WebSocket messages
- Send transcription results back to clients
- Emit transcription events for agent processing
src/core/TranscriptionManager.ts:17
Constructor
Optional configuration object
AI SDK transcription model instance (e.g.,
openai.transcription('whisper-1'))Maximum audio size in bytes (default: 25 MB). Audio larger than this will be rejected.
Properties
hasTranscriptionModel
Returns
true if a transcription model is configured. If false, transcription requests will fail.sendMessage
Callback function to send messages over WebSocket. Must be set by the parent agent.
Methods
transcribeAudio()
Transcribe audio data to text using the configured transcription model.Raw audio data in a format supported by the transcription model (e.g., WAV, MP3, OGG)
transcription- When transcription succeeds
transcription_result- Contains transcribed text and detected language
- Uses AI SDK’s
experimental_transcribefunction - Logs audio size and transcription result
- Returns the detected language in the transcription event
- Automatically sends result to client via WebSocket
processAudioInput()
Process incoming base64-encoded audio: validate, decode, and transcribe.Base64-encoded audio data from the client
Optional audio format hint (e.g., ‘wav’, ‘mp3’, ‘ogg’, ‘webm’)
null if validation fails or transcription returns empty.
Example:
- Check model configured - Returns
nullif no transcription model - Decode base64 - Converts to Buffer
- Validate size - Rejects if exceeds
maxAudioInputSize - Check empty - Returns
nullif buffer is empty - Transcribe - Calls
transcribeAudio() - Validate result - Returns
nullif transcription is empty
audio_received- When audio is received and validatedtranscription- When transcription succeedswarning- For empty audio or empty transcription resultserror- For size limits or transcription failures
transcription_result- On successtranscription_error- On failureerror- For configuration errors
Events
TheTranscriptionManager extends EventEmitter and emits the following events:
transcription
Emitted when audio is successfully transcribed.The transcribed text
The detected language code (e.g., ‘en’, ‘es’, ‘fr’) if available
audio_received
Emitted when audio data is received and validated.Audio data size in bytes
Audio format if provided
warning
Emitted for non-critical issues (empty audio, empty transcription).error
Emitted when transcription fails or validation errors occur.WebSocket Message Protocol
Incoming Messages
Audio input from client:Outgoing Messages
Transcription result:Usage in Agent Architecture
Audio Format Support
The TranscriptionManager supports various audio formats depending on the underlying transcription model:Common Formats (Whisper)
- WAV - Uncompressed audio (large file size)
- MP3 - Compressed, widely supported
- OGG/Opus - Efficient compression, good for real-time
- WebM - Modern format, browser-friendly
- FLAC - Lossless compression
The actual format support depends on the transcription model provider. OpenAI’s Whisper supports most common formats.
Size Limits and Validation
Default Limit
The defaultmaxAudioInputSize is 25 MB (26,214,400 bytes).
Choosing the Right Limit
Consider these factors:
- API limits: OpenAI Whisper has a 25 MB limit
- Network latency: Larger files take longer to upload
- Use case: Short voice messages vs. long recordings
- Format: Compressed formats (MP3, Opus) are smaller than WAV
Size Validation Flow
Error Handling
Configuration Errors
Transcription Errors
Empty Result Handling
Performance Considerations
Latency
- Network upload: Depends on audio size and connection speed
- API processing: Typically 1-3 seconds for Whisper
- Total latency: Usually 2-5 seconds for short audio clips
Optimization Tips
- Use compressed formats - Opus/MP3 are much smaller than WAV
- Keep recordings short - Under 30 seconds for best UX
- Show loading indicators - Transcription is not instant
- Handle errors gracefully - Provide fallback UI for failures
Related
- WebSocketManager - Receives audio data from clients
- SpeechManager - Works with TranscriptionManager for barge-in
- VoiceAgent - Orchestrates transcription with conversation flow
- AI SDK Transcription - Underlying transcription API