Speech-to-Text Models Overview
react-native-sherpa-onnx supports a wide range of speech-to-text (STT) model architectures, from fast streaming transducers to multilingual models like Whisper. This guide helps you choose the right model for your use case.Model Comparison
Zipformer/Transducer
Fast streaming recognition with excellent accuracy. Best for real-time use cases.
Paraformer
Non-autoregressive ASR. Fast batch processing with high accuracy.
Whisper
Multilingual, robust zero-shot recognition. Strong for diverse audio conditions.
NeMo CTC
Excellent for English and streaming. Good balance of speed and accuracy.
Other Models
WeNet, SenseVoice, FunASR, Moonshine, and specialized models.
Quick Comparison Table
| Model Type | Streaming | Multilingual | Speed | Use Case |
|---|---|---|---|---|
| Zipformer/Transducer | ✅ Yes | Depends | Fast | Real-time recognition, voice assistants |
| LSTM Transducer | ✅ Yes | Depends | Fast | Streaming ASR, mobile apps |
| Paraformer | ✅ Yes | Limited | Very Fast | Fast batch transcription |
| Whisper | ❌ No | ✅ Yes (90+ langs) | Medium | Multilingual transcription, diverse audio |
| NeMo CTC | ✅ Yes | Limited | Fast | English streaming, live captions |
| WeNet CTC | ❌ No | Limited | Fast | Compact deployment |
| SenseVoice | ❌ No | ✅ Yes | Medium | Emotion detection, punctuation |
| FunASR Nano | ❌ No | Limited | Medium | LLM-based ASR with prompts |
| Moonshine | ✅ Yes (v1 & v2) | Limited | Fast | Streaming-capable lightweight ASR |
| Fire Red ASR | ❌ No | Limited | Medium | Encoder-decoder ASR |
| Dolphin | ❌ No | Limited | Fast | Single-model CTC |
| Canary | ❌ No | ✅ Yes | Medium | Multilingual NeMo model |
| Omnilingual | ❌ No | ✅ Yes | Medium | Wide language coverage |
| Tone CTC | ✅ Yes | Limited | Very Fast | Lightweight streaming CTC |
Choosing a Model
For Real-Time Recognition (Streaming)
If you need live recognition from a microphone, choose one of these streaming-capable models:- Zipformer/Transducer – Best overall for streaming, excellent accuracy
- NeMo CTC – Great for English streaming applications
- Tone CTC – Lightweight option for resource-constrained devices
- LSTM Transducer – LSTM-based streaming alternative
- Paraformer – Fast streaming with non-autoregressive approach
- Moonshine – Modern streaming-capable architecture
For Batch/Offline Transcription
If you’re transcribing pre-recorded audio files:- Whisper – Best for multilingual content, robust to noise
- Paraformer – Fastest for single-language batch processing
- SenseVoice – When you need emotion labels and punctuation
- Canary – Multilingual with good accuracy
By Language Support
English Only:- NeMo CTC (streaming)
- Tone CTC (streaming)
- Many Zipformer variants
- Whisper (offline)
- Canary (offline)
- Omnilingual (offline)
- SenseVoice (5 languages + emotion)
- Paraformer (excellent for Mandarin)
- FunASR Nano (LLM-based with prompts)
- SenseVoice (Chinese + emotion)
By Device Constraints
Low-end devices / limited RAM:- Tone CTC (lightweight streaming)
- Dolphin (compact single-model)
- WeNet CTC (compact deployment)
- Use
int8quantized variants when available
- Whisper (large models)
- Canary (multilingual)
- Full Zipformer models
Model Detection
The SDK automatically detects model types based on folder name patterns and file layouts. You can also force a specific type:Performance Tips
Use Quantized Models
SetpreferInt8: true to automatically use int8 quantized models when available:
Adjust Thread Count
Increase threads on multi-core devices:Use Execution Providers
Leverage hardware acceleration:Download Links
All model downloads are available from the sherpa-onnx pretrained models repository:Next Steps
Model Setup Guide
Learn how to download and bundle models with your app
STT API Reference
Detailed API documentation for speech recognition
Streaming STT
Real-time recognition from microphone
Hotwords
Contextual biasing for improved accuracy