Overview
The offline STT (Speech-to-Text) module provides complete audio file transcription using sherpa-onnx models. Use this when you have complete audio files to transcribe, as opposed to real-time streaming recognition. Key features:- Transcribe complete audio files or PCM samples
- Support for multiple model types (Whisper, Paraformer, Transducer, and more)
- Automatic model type detection
- Hotwords support for contextual biasing (transducer models)
- Token timestamps and language detection (model-dependent)
- Runtime configuration updates
Quick Start
Supported Model Types
The following model types are supported with automatic detection:| Model Type | Description | Typical Files |
|---|---|---|
transducer | Transducer models | encoder.onnx, decoder.onnx, joiner.onnx, tokens.txt |
nemo_transducer | NeMo Transducer | encoder.onnx, decoder.onnx, joiner.onnx, tokens.txt |
paraformer | Paraformer models | model.onnx, tokens.txt |
whisper | OpenAI Whisper | encoder.onnx, decoder.onnx, tokens.txt |
sense_voice | SenseVoice models | model.onnx, tokens.txt |
nemo_ctc | NeMo CTC models | model.onnx, tokens.txt |
zipformer_ctc | Zipformer CTC | model.onnx, tokens.txt |
wenet_ctc | WeNet CTC | model.onnx, tokens.txt |
funasr_nano | FunASR Nano | encoder_adaptor, llm, embedding, tokenizer |
fire_red_asr | FireRed ASR | encoder, decoder |
moonshine | Moonshine | preprocess.onnx, encode.onnx, decode.onnx, tokens.txt |
dolphin | Dolphin | model.onnx, tokens.txt |
canary | Canary | encoder, decoder |
auto | Auto-detect | Detects based on files present |
API Reference
createSTT(options)
Creates an STT engine instance for offline transcription.src/stt/index.ts
Model directory path configuration. Can be:
{ type: 'asset', path: 'models/...' }for bundled assets{ type: 'file', path: '/absolute/path' }for filesystem{ type: 'auto', path: '...' }to try asset then file
Model type to use. Set to
'auto' for automatic detection based on files.Prefer int8 quantized models (faster, smaller) when available.
true: Prefer int8 modelsfalse: Prefer full precisionundefined: Try int8 first, fallback to full precision (default)
Number of threads for inference.
Execution provider (e.g.,
'cpu', 'qnn', 'nnapi', 'xnnpack'). See Execution Providers for details.Path to hotwords file for contextual biasing. Only supported for transducer models (
transducer, nemo_transducer).Hotwords boost score (only applies when
hotwordsFile is set).Enable debug logging in native layer.
Model-specific options. Only the block for the loaded model type is applied:
whisper:{ language, task, tailPaddings, enableTokenTimestamps, enableSegmentTimestamps }senseVoice:{ language, useItn }canary:{ srcLang, tgtLang, usePnc }funasrNano:{ systemPrompt, userPrompt, maxNewTokens, temperature, topP, seed, language, itn, hotwords }
SttEngine: transcribeFile(filePath)
Transcribe a complete audio file.- Format: WAV (PCM)
- Sample rate: 16 kHz (recommended, model-dependent)
- Channels: Mono
- Bit depth: 16-bit
SttRecognitionResult:
SttEngine: transcribeSamples(samples, sampleRate)
Transcribe from raw PCM samples (e.g., from microphone or decoder).samples: Float PCM samples in range[-1, 1], monosampleRate: Sample rate in Hz
SttEngine: setConfig(config)
Update recognizer configuration at runtime.SttEngine: destroy()
Release native resources. Must be called when the engine is no longer needed.Model-Specific Options
Whisper Models
SenseVoice Models
Canary Models
Hotwords (Contextual Biasing)
Hotwords allow you to boost recognition of specific words or phrases.Hotwords are only supported for transducer models (
transducer, nemo_transducer). Check support before showing UI:Hotwords File Format
Create a text file with one phrase per line, optionally with a boost factor:Using Hotwords
Model Detection
Detect model type without initializing:Performance Optimization
Quantization
Int8 quantized models are faster and use less memory:Threading
Increase threads for faster processing on multi-core devices:Hardware Acceleration
Use hardware acceleration when available:See Execution Providers for detailed information on hardware acceleration.
Common Use Cases
Transcribe with Language Detection
Batch Processing Multiple Files
Troubleshooting
Error: STT initialization failed
Error: STT initialization failed
- Verify the model directory exists and contains all required files
- Check that model files match the expected structure for the model type
- Try
modelType: 'auto'to let the SDK detect the type - Enable
debug: trueto see detailed initialization logs
Poor transcription quality
Poor transcription quality
- Ensure audio is 16 kHz mono WAV (most models)
- Check audio quality and noise levels
- Try a larger/better model
- Use
preferInt8: falsefor full precision
Hotwords not working
Hotwords not working
- Verify the model type supports hotwords (
transducer,nemo_transduceronly) - Check
modelingUnitandbpeVocabare set correctly for BPE models - Ensure hotwords file format is correct (one phrase per line)
- Increase
hotwordsScorefor stronger boosting
Out of memory errors
Out of memory errors
- Use
preferInt8: truefor smaller models - Reduce
numThreads - Process shorter audio segments
- Close other apps to free memory
Next Steps
Streaming STT
Real-time recognition with partial results
Model Setup
Learn how to bundle and load models
Execution Providers
Hardware acceleration (QNN, NNAPI, XNNPACK)
Text-to-Speech
Convert text to speech