Zipformer & Transducer Models
Zipformer and Transducer models use an encoder-decoder-joiner architecture (also known as RNN-T) and provide the best balance of speed and accuracy for streaming speech recognition.Model Architecture
Transducer models consist of three components:- Encoder (
encoder.onnx) – Processes audio features - Decoder (
decoder.onnx) – Language model component - Joiner (
joiner.onnx) – Combines encoder and decoder outputs - Tokens (
tokens.txt) – Token vocabulary
Variants
Zipformer (Standard)
Modern transformer-based transducer models:- Excellent accuracy
- Fast inference
- Streaming capable
- Lower memory usage than LSTM variants
LSTM Transducer
LSTM-based transducer models:- Same encoder-decoder-joiner layout
- Good for streaming ASR
- Detected automatically as
transducertype - May have lower memory footprint
When to Use
Real-Time Recognition
Live transcription from microphone with low latency and partial results
Voice Assistants
Interactive voice interfaces with fast response times
Live Captions
Real-time subtitle generation for videos or meetings
Contextual Biasing
Supports hotwords for domain-specific vocabulary (see below)
Supported Languages
Available in many languages including:- English (multiple variants)
- Chinese (Mandarin, Cantonese)
- German, French, Spanish
- Russian, Japanese, Korean
- And many more
Performance Characteristics
| Aspect | Rating | Notes |
|---|---|---|
| Streaming | ✅ Excellent | Native streaming support with low latency |
| Accuracy | ⭐⭐⭐⭐⭐ | Very high accuracy, especially for trained languages |
| Speed | ⭐⭐⭐⭐⭐ | Fast inference, real-time capable |
| Memory | ⭐⭐⭐⭐ | Moderate memory usage, int8 models available |
| Model Size | Medium | Typically 50-150 MB (varies by language/variant) |
Download Links
Zipformer Models
Browse and download pretrained Zipformer models
LSTM Transducer Models
Browse and download LSTM transducer models
Configuration Example
Offline Transcription
Streaming Recognition
Hotwords Support
Transducer models are the only model type that supports hotwords (contextual biasing) for boosting domain-specific vocabulary:hotwords.txt):
Runtime Configuration
You can update recognition parameters at runtime:Model Detection
Folder name should containzipformer or transducer for auto-detection. LSTM models may contain lstm in the folder name.
Expected files:
encoder.onnx(orencoder.int8.onnx)decoder.onnx(ordecoder.int8.onnx)joiner.onnx(orjoiner.int8.onnx)tokens.txt
Performance Tips
Use Quantized Models
- 3-4x smaller
- 2-3x faster
- Minimal accuracy loss
Optimize Thread Count
Use Hardware Acceleration
Streaming Support
Streaming: ✅ YesTransducer models have native streaming support. Use
createStreamingSTT() for real-time recognition.Common Issues
Model not loading
Model not loading
- Ensure all three files are present:
encoder.onnx,decoder.onnx,joiner.onnx - Check that
tokens.txtexists - Verify folder name contains
zipformerortransducerfor auto-detection
Hotwords not working
Hotwords not working
- Verify
modelTypeis'transducer'(hotwords only work with transducer models) - Check hotwords file format (one phrase per line, optional boost value)
- Use
sttSupportsHotwords(modelType)to verify compatibility
Poor streaming performance
Poor streaming performance
- Increase
numThreadsif device has multiple cores - Use
preferInt8: truefor int8 quantized models - Enable hardware acceleration with
provider: 'nnapi'orprovider: 'xnnpack'
Next Steps
Streaming STT
Learn about real-time recognition
Hotwords
Boost domain-specific vocabulary
Model Setup
How to download and bundle models
Execution Providers
Hardware acceleration options