Whisper Models
Whisper is OpenAI’s multilingual speech recognition model with excellent zero-shot performance across 90+ languages. It’s robust to diverse audio conditions and accents.Model Architecture
Whisper uses an encoder-decoder architecture (without a joiner):- Encoder (
encoder.onnxorencoder.int8.onnx) – Processes audio - Decoder (
decoder.onnxordecoder.int8.onnx) – Generates text tokens - Tokens (
tokens.txt) – Multilingual token vocabulary
When to Use
Multilingual Content
Transcribe audio in 90+ languages without language-specific models
Diverse Audio
Robust to accents, background noise, and varying audio quality
Translation
Built-in translation to English (set task: ‘translate’)
Zero-Shot Recognition
Good accuracy without language-specific fine-tuning
Supported Languages
Whisper supports 90+ languages including:- English, Spanish, French, German, Italian, Portuguese
- Chinese (Mandarin, Cantonese), Japanese, Korean
- Arabic, Russian, Hindi, Bengali
- And many more…
Performance Characteristics
| Aspect | Rating | Notes |
|---|---|---|
| Streaming | ❌ Not Supported | Offline/batch only (encoder-decoder architecture) |
| Accuracy | ⭐⭐⭐⭐⭐ | Excellent multilingual accuracy |
| Speed | ⭐⭐⭐ | Slower than CTC/transducer, but acceptable |
| Memory | ⭐⭐⭐ | Larger models need significant RAM |
| Model Size | Large | Tiny: ~40 MB, Base: ~75 MB, Small: ~250 MB, Large: 1+ GB |
Download Links
Whisper Models
Browse and download pretrained Whisper models (Tiny, Base, Small, Medium, Large)
Configuration Example
Basic Transcription
With Language Selection
Translation to English
Model Options
Whisper supports several configuration options viamodelOptions.whisper:
| Option | Type | Description |
|---|---|---|
language | string | Language code (e.g. 'en', 'de', 'zh'). Use getWhisperLanguages() for valid codes. Omit for auto-detection. |
task | 'transcribe' | 'translate' | 'transcribe' returns text in source language. 'translate' translates to English. |
tailPaddings | number | Padding at the end of audio (default from model) |
enableTokenTimestamps | boolean | Enable token-level timestamps (Android only) |
enableSegmentTimestamps | boolean | Enable segment timestamps (Android only) |
Language Helpers
Model Variants
| Variant | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| Tiny | ~40 MB | Very Fast | Good | Mobile devices, quick transcription |
| Base | ~75 MB | Fast | Good | Balanced mobile performance |
| Small | ~250 MB | Medium | Very Good | High-quality mobile transcription |
| Medium | ~800 MB | Slow | Excellent | High-end devices, best quality |
| Large | 1+ GB | Very Slow | Best | Server-side, maximum accuracy |
Model Detection
Whisper models are detected by:- Presence of
encoder.onnx+decoder.onnx(nojoiner.onnx) - Optional folder name pattern (containing
whisper)
encoder.onnx(orencoder.int8.onnx)decoder.onnx(ordecoder.int8.onnx)tokens.txt
Performance Tips
Use Quantized Models
Int8 quantization significantly reduces size and improves speed:Choose the Right Variant
Balance size, speed, and accuracy:Optimize Thread Count
Streaming Support
Advantages
- Multilingual: 90+ languages without separate models
- Robust: Handles accents, noise, and varying audio quality
- Translation: Built-in translation to English
- Zero-Shot: Good accuracy without fine-tuning
- Widely Used: Battle-tested, well-documented
Limitations
- No Streaming: Cannot be used for real-time recognition
- Slower: Encoder-decoder is slower than CTC models
- Larger Models: Bigger files and memory footprint
- No Hotwords: Does not support contextual biasing
Use Cases
Multilingual Apps
Apps serving users in multiple countries/languages
Content Transcription
Transcribing podcasts, interviews, or videos
Subtitle Generation
Creating subtitles for pre-recorded content
Translation
Translating audio from any language to English
Common Issues
App crashes with invalid language
App crashes with invalid language
- Use
getWhisperLanguages()to get valid language codes - Never use free-text input for
languageoption - Omit
languagefor auto-detection
Cannot use for streaming
Cannot use for streaming
- Whisper does not support streaming
- Use Transducer, NeMo CTC, or Tone CTC for real-time recognition
- Use
getOnlineTypeOrNull(modelType)to check if a model supports streaming
Slow transcription
Slow transcription
- Use smaller variants (Tiny or Base)
- Enable
preferInt8: truefor quantized models - Increase
numThreadson multi-core devices - Consider using Paraformer or CTC models for faster batch processing
High memory usage
High memory usage
- Use Tiny or Base variants instead of Small/Medium/Large
- Enable int8 quantization
- Ensure no other heavy apps are running
Next Steps
STT API
Detailed API documentation
Model Setup
How to download and bundle models
Transducer Models
For streaming recognition
Execution Providers
Hardware acceleration options