Core Features
Speech to Text (STT)
Convert spoken audio into text using on-device Whisper models. Supports both single-pass transcription and streaming modes with real-time results.- Multilingual support: 96+ languages with automatic language detection
- Streaming transcription: Real-time audio processing with committed and non-committed results
- Word-level timestamps: Detailed timing information when verbose mode is enabled
- Audio format: Requires 16kHz mono audio as Float32Array
Text to Speech (TTS)
Generate natural-sounding speech from text using the Kokoro TTS model. Supports both complete audio generation and streaming playback.- Multiple voices: English (US and GB) with customizable voice embeddings
- Speed control: Adjustable speech rate for different use cases
- Streaming mode: Start playback before full synthesis completes
- Audio output: Returns 22kHz mono audio as Float32Array
Voice Activity Detection (VAD)
Detect speech segments in audio streams with precise timestamp boundaries. Essential for building voice-activated features and efficient audio processing.- Segment detection: Identifies start and end times of speech activity
- Low latency: Fast on-device processing for real-time applications
- Audio format: Requires 16kHz mono audio as Float32Array
- Timestamp precision: Returns segments in seconds
Audio Format Requirements
All speech and audio models require audio in a specific format:- Sample rate: 16kHz (except TTS output which is 22kHz)
- Channels: Mono (single channel)
- Data type: Float32Array with values normalized between -1.0 and 1.0
- Buffer format: Contiguous samples in time order
Common Use Cases
Voice Assistant
Combine VAD, STT, and TTS to build a complete voice assistant that listens, understands, and responds.
Live Transcription
Use streaming STT with VAD to provide real-time captions for meetings, lectures, or media.
Audio Books
Generate natural-sounding narration from text content with speed control.
Voice Commands
Detect when users speak and transcribe commands for hands-free interaction.
Best Practices
Audio Preprocessing
- Always resample audio to 16kHz before processing
- Convert stereo audio to mono by averaging channels
- Normalize audio samples to the range [-1.0, 1.0]
- Remove DC offset and apply appropriate filtering
Memory Management
- Reuse Float32Array buffers when possible to reduce allocations
- Process audio in chunks for long recordings
- Clean up resources when components unmount
- Monitor download progress for large model files
Performance Optimization
- Use streaming modes for real-time requirements
- Batch audio processing when latency is not critical
- Leverage VAD to process only speech segments
- Cache models to avoid repeated downloads
Model Downloads
All speech models support automatic downloading from remote sources:Error Handling
All speech hooks provide error states for handling failures:Next Steps
Speech to Text
Convert audio to text
Text to Speech
Generate spoken audio
Voice Activity Detection
Detect speech segments