Paraformer Models
Paraformer is a non-autoregressive speech recognition model that offers excellent speed and accuracy for both offline and streaming use cases.Model Architecture
Paraformer uses a single-model architecture:- Model (
model.onnxormodel.int8.onnx) – Single neural network - Tokens (
tokens.txt) – Token vocabulary
When to Use
Fast Batch Processing
Excellent for transcribing multiple audio files quickly
Chinese Speech
Outstanding accuracy for Mandarin Chinese
Streaming Recognition
Supports streaming mode for real-time transcription
Resource-Constrained Devices
Single-model architecture uses less memory
Supported Languages
Paraformer models are primarily available for:- Chinese (Mandarin) – Excellent accuracy, widely used
- English – Some bilingual variants available
- Chinese + English – Bilingual models
Performance Characteristics
| Aspect | Rating | Notes |
|---|---|---|
| Streaming | ✅ Supported | Streaming-capable with good latency |
| Accuracy | ⭐⭐⭐⭐⭐ | Very high accuracy, especially for Chinese |
| Speed | ⭐⭐⭐⭐⭐ | Fast non-autoregressive inference |
| Memory | ⭐⭐⭐⭐⭐ | Low memory usage (single model) |
| Model Size | Small-Medium | Typically 50-200 MB depending on variant |
Download Links
Paraformer Models
Browse and download pretrained Paraformer models
Configuration Example
Offline Transcription
Transcribe from Samples
Streaming Recognition
Model Detection
Paraformer models are detected automatically by the presence ofmodel.onnx (or model.int8.onnx) and tokens.txt. No folder name pattern is required.
Expected files:
model.onnx(ormodel.int8.onnx)tokens.txt
Performance Tips
Use Quantized Models
Int8 quantized Paraformer models offer excellent speed:Optimize for Batch Processing
For transcribing multiple files:Hardware Acceleration
Streaming Support
Streaming: ✅ YesParaformer supports streaming recognition. Use
createStreamingSTT() with modelType: 'paraformer' for real-time transcription.Advantages
- Fast Inference: Non-autoregressive decoding is faster than autoregressive models
- Simple Deployment: Single model file, no separate encoder/decoder/joiner
- Excellent for Chinese: State-of-the-art accuracy for Mandarin
- Low Memory: Single-model architecture uses less RAM
- Streaming Capable: Supports real-time recognition
Limitations
- Language Coverage: Primarily Chinese-focused, fewer English-only variants
- No Hotwords: Does not support contextual biasing (use transducer models for hotwords)
- Domain-Specific: Best suited for general Chinese speech (not specialized domains without fine-tuning)
Use Cases
Chinese Transcription
Transcribing Chinese audio files, podcasts, or videos
Real-Time Subtitles
Live Chinese captions for streaming or conferencing
Voice Input
Chinese voice input for apps and forms
Batch Processing
Transcribing large collections of Chinese audio
Common Issues
Model not loading
Model not loading
- Verify
model.onnxandtokens.txtare present - Check that the model path is correct
- Ensure sufficient device memory
Poor accuracy on non-Chinese audio
Poor accuracy on non-Chinese audio
- Paraformer models are optimized for Chinese
- Use Whisper or transducer models for other languages
- Check if you’re using a bilingual (Chinese+English) variant
Slow performance
Slow performance
- Enable
preferInt8: truefor quantized models - Increase
numThreadson multi-core devices - Use hardware acceleration (
provider: 'nnapi')
Comparison with Other Models
| Feature | Paraformer | Transducer | Whisper |
|---|---|---|---|
| Speed | Very Fast | Fast | Medium |
| Chinese Accuracy | Excellent | Good | Good |
| Streaming | Yes | Yes | No |
| Hotwords | No | Yes | No |
| Multilingual | Limited | Varies | Excellent |
| Model Size | Small | Medium | Large |
Next Steps
STT API
Detailed API documentation
Streaming STT
Real-time recognition guide
Model Setup
How to download and bundle models
Execution Providers
Hardware acceleration options