Overview
The STT module provides offline speech recognition capabilities. Create an engine withcreateSTT, then transcribe audio from files or float samples. Both methods return comprehensive results with text, tokens, timestamps, detected language, emotion, and events (model-dependent).
Quick Start
Transcribe from File
Transcribe a WAV file (16 kHz mono recommended):Result Fields
| Field | Type | Description |
|---|---|---|
text | string | Transcribed text |
tokens | string[] | Token strings |
timestamps | number[] | Timestamps per token (model-dependent) |
lang | string | Detected or specified language |
emotion | string | Emotion label (e.g. SenseVoice) |
event | string | Event label (model-dependent) |
durations | number[] | Durations for TDT models |
Transcribe from Samples
Transcribe from float PCM samples (mono, [-1, 1]):Supported Model Types
The SDK supports multiple STT model architectures:| Model Type | Description | Files Required |
|---|---|---|
transducer | Zipformer transducer | encoder.onnx, decoder.onnx, joiner.onnx, tokens.txt |
nemo_transducer | NVIDIA NeMo transducer | encoder.onnx, decoder.onnx, joiner.onnx, tokens.txt |
paraformer | Alibaba Paraformer | model.onnx, tokens.txt |
whisper | OpenAI Whisper | encoder.onnx, decoder.onnx, tokens.txt |
sense_voice | SenseVoice multilingual | model.onnx, tokens.txt |
nemo_ctc | NVIDIA NeMo CTC | model.onnx, tokens.txt |
wenet_ctc | WeNet CTC | model.onnx, tokens.txt |
funasr_nano | FunASR Nano | encoder_adaptor, llm, embedding, tokenizer |
moonshine | Moonshine | preprocess.onnx, encode.onnx, decode.onnx, tokens.txt |
dolphin | Dolphin | model.onnx, tokens.txt |
canary | Canary multilingual | encoder, decoder |
modelType: 'auto' for automatic detection based on directory structure.
Model-Specific Options
Configure model-specific options via themodelOptions parameter:
Whisper
getWhisperLanguages() to get the full list of supported language objects { id, name }.
SenseVoice
Canary
FunASR Nano
Hotwords (Contextual Biasing)
Boost recognition of specific words or phrases. Only supported for transducer models (transducer, nemo_transducer).
Hotwords File Format
One phrase per line with optional boost score:Runtime Config Updates
Update hotwords and decoding parameters without reloading:Advanced Configuration
Threading and Performance
Execution Providers
Accelerate inference with hardware backends:Inverse Text Normalization (ITN)
Convert spoken forms to written forms (e.g., “twenty twenty four” → “2024”):Best Practices
Audio Format
- Sample rate: Most models expect 16 kHz; some support 8/16/48 kHz
- Channels: Mono (single channel)
- Format: 16-bit PCM WAV
- Pre-process: Use
convertAudioToWav16kto ensure correct format
Long Audio Files
For very long recordings, consider:- Splitting into smaller chunks to reduce memory usage
- Using streaming STT for real-time processing
- Processing in background to avoid blocking UI
Memory Management
Error Handling
Model Discovery
List available bundled models:Next Steps
Streaming STT
Real-time speech recognition with live transcription
Model Setup
Download and configure STT models