Complete Model Comparison
Omnilingual ASR offers 27 models across four families: W2V (self-supervised), CTC (parallel ASR), LLM (autoregressive ASR), and Zero-Shot models.All VRAM and speed metrics measured on A100 GPU with BF16 precision, batch size 1, and 30-second audio (unless noted otherwise).
Model Families Overview
W2V Models
Self-Supervised Learning (SSL)Pre-trained audio encoders producing contextualized embeddings. Useful as starting points for custom architectures.
- 4 sizes: 300M, 1B, 3B, 7B
- No direct transcription
- Foundation for CTC/LLM models
CTC Models
Parallel ASRHigh-speed speech recognition with parallel generation. Ideal for production deployments requiring throughput.
- 4 sizes × 2 versions = 8 models
- 16x-96x faster than real-time
- No language conditioning
LLM Models
Autoregressive ASRState-of-the-art accuracy with language conditioning. Standard and Unlimited length variants.
- 4 sizes × 3 variants = 12 models
- Optional language conditioning
- Unlimited length support (v2)
Zero-Shot
In-Context LearningTranscribe unseen languages using 1-10 audio-text example pairs.
- 1 model (7B)
- Requires context examples
- Ideal for low-resource languages
Complete Specifications Table
W2V Models (Self-Supervised)
| Model | Parameters | Download | Features | Embedding Dim |
|---|---|---|---|---|
| omniASR_W2V_300M | 317,390,592 | 1.2 GiB | SSL | 1024 |
| omniASR_W2V_1B | 965,514,752 | 3.6 GiB | SSL | 1280 |
| omniASR_W2V_3B | 3,064,124,672 | 12.0 GiB | SSL | 2048 |
| omniASR_W2V_7B | 6,488,487,168 | 25.0 GiB | SSL | 2048 |
W2V Model Details
W2V Model Details
Input: Raw audio waveform (16kHz)Output: Contextualized audio embeddings
- 300M: 1024-dimensional vectors
- 1B: 1280-dimensional vectors
- 3B/7B: 2048-dimensional vectors
CTC Models (Parallel ASR)
Version 1 Models
| Model | Parameters | Download | VRAM | RTF | Speed | Vocab |
|---|---|---|---|---|---|---|
| omniASR_CTC_300M | 325,494,996 | 1.3 GiB | ~2 GiB | 0.001 | 96x | 9,812 |
| omniASR_CTC_1B | 975,065,300 | 3.7 GiB | ~3 GiB | 0.002 | 48x | 9,812 |
| omniASR_CTC_3B | 3,080,423,636 | 12.0 GiB | ~8 GiB | 0.003 | 32x | 9,812 |
| omniASR_CTC_7B | 6,504,786,132 | 25.0 GiB | ~15 GiB | 0.006 | 16x | 9,812 |
Version 2 Models (Improved CER)
| Model | Parameters | Download | VRAM | RTF | Speed | Vocab |
|---|---|---|---|---|---|---|
| omniASR_CTC_300M_v2 | 325,494,996 | 1.3 GiB | ~2 GiB | 0.001 | 96x | 10,288 |
| omniASR_CTC_1B_v2 | 975,065,300 | 3.7 GiB | ~3 GiB | 0.002 | 48x | 10,288 |
| omniASR_CTC_3B_v2 | 3,080,423,636 | 12.0 GiB | ~8 GiB | 0.003 | 32x | 10,288 |
| omniASR_CTC_7B_v2 | 6,504,786,132 | 25.0 GiB | ~15 GiB | 0.006 | 16x | 10,288 |
CTC Model Details
CTC Model Details
Features: Parallel generation, non-autoregressive decodingTokenizers:
- v1 models:
omniASR_tokenizer_v1(9,812 tokens) - v2 models:
omniASR_tokenizer_written_v2(10,288 tokens)
LLM Models (Autoregressive ASR)
Standard LLM - Version 1
| Model | Parameters | Download | VRAM | RTF | Vocab | Max Audio |
|---|---|---|---|---|---|---|
| omniASR_LLM_300M | 1,627,603,584 | 6.1 GiB | ~5 GiB | 0.090 | 9,812 | 40s |
| omniASR_LLM_1B | 2,275,710,592 | 8.5 GiB | ~6 GiB | 0.091 | 9,812 | 40s |
| omniASR_LLM_3B | 4,376,679,040 | 17.0 GiB | ~10 GiB | 0.093 | 9,812 | 40s |
| omniASR_LLM_7B | 7,801,041,536 | 30.0 GiB | ~17 GiB | 0.092 | 9,818 | 40s |
Standard LLM - Version 2 (Improved CER)
| Model | Parameters | Download | VRAM | RTF | Vocab | Max Audio |
|---|---|---|---|---|---|---|
| omniASR_LLM_300M_v2 | 1,627,603,584 | 6.1 GiB | ~5 GiB | 0.090 | 10,288 | 40s |
| omniASR_LLM_1B_v2 | 2,275,710,592 | 8.5 GiB | ~6 GiB | 0.091 | 10,288 | 40s |
| omniASR_LLM_3B_v2 | 4,376,679,040 | 17.0 GiB | ~10 GiB | 0.093 | 10,288 | 40s |
| omniASR_LLM_7B_v2 | 7,801,041,536 | 30.0 GiB | ~17 GiB | 0.092 | 10,288 | 40s |
Unlimited Length LLM - Version 2
| Model | Parameters | Download | VRAM | RTF (30s) | RTF (15min) | Max Audio |
|---|---|---|---|---|---|---|
| omniASR_LLM_Unlimited_300M_v2 | 1,627,603,584 | 6.1 GiB | ~5 GiB | 0.092 | 0.206 | Unlimited |
| omniASR_LLM_Unlimited_1B_v2 | 2,275,710,592 | 8.5 GiB | ~6 GiB | 0.097 | 0.207 | Unlimited |
| omniASR_LLM_Unlimited_3B_v2 | 4,376,679,040 | 17.0 GiB | ~10 GiB | 0.095 | 0.208 | Unlimited |
| omniASR_LLM_Unlimited_7B_v2 | 7,801,041,536 | 30.0 GiB | ~17 GiB | 0.097 | 0.208 | Unlimited |
LLM Model Details
LLM Model Details
Features:
- Optional language conditioning (80/20 training split with/without)
- Autoregressive beam search decoding
- State-of-the-art accuracy
- v1 models (300M/1B/3B):
omniASR_tokenizer_v1 - v1 model (7B):
omniASR_tokenizer_v1_variant7 - v2 models:
omniASR_tokenizer_written_v2
- Segment size: 15 seconds
- Context window: 1 previous segment
- Accuracy comparable to standard LLM models
- Fine-tuning not currently supported
- Can be extended for streaming applications
- Standard: Maximum accuracy, known languages, audio under 40s
- Unlimited: Long-form content (podcasts, lectures, meetings)
Zero-Shot Model
| Model | Parameters | Download | VRAM | RTF | Vocab | Context Required |
|---|---|---|---|---|---|---|
| omniASR_LLM_7B_ZS | 7,810,900,608 | 30.0 GiB | ~20 GiB | 0.194 | 9,812 | 1-10 examples |
Zero-Shot Model Details
Zero-Shot Model Details
Features: In-context learning with audio-text example pairsTokenizer:
omniASR_tokenizer_v1Context Examples:- Minimum: 1 example (repeated to 10)
- Maximum: 10 examples
- Recommended: 5-10 diverse examples
- Max length per example: 30 seconds
Tokenizers
| Tokenizer | Size | Used By | Vocab Size |
|---|---|---|---|
| omniASR_tokenizer_v1 | 100 KiB | W2V, CTC v1, LLM v1 (300M/1B/3B), ZS | 9,812 |
| omniASR_tokenizer_v1_variant7 | 100 KiB | LLM v1 (7B only) | 9,818 |
| omniASR_tokenizer_written_v2 | 100 KiB | CTC v2, LLM v2, LLM Unlimited v2 | 10,288 |
Performance Metrics
Speed Comparison (Real-Time Factor)
RTF (Real-Time Factor): Time to process 1 second of audio. Lower is faster.
- RTF = 0.001: 1000x faster than real-time (1s audio in 0.001s)
- RTF = 1.0: Real-time processing (1s audio in 1s)
- RTF = 0.092: ~11x faster than real-time (1s audio in 0.092s)
| Model Family | RTF Range | Speed vs Real-Time | Relative to LLM 7B |
|---|---|---|---|
| CTC 300M | 0.001 | 1000x faster | 96x faster |
| CTC 1B | 0.002 | 500x faster | 48x faster |
| CTC 3B | 0.003 | 333x faster | 32x faster |
| CTC 7B | 0.006 | 167x faster | 16x faster |
| LLM (all) | 0.090-0.093 | ~11x faster | 1x (baseline) |
| LLM Unlimited (30s) | 0.092-0.097 | ~11x faster | ~1x |
| LLM Unlimited (15min) | 0.206-0.208 | ~5x faster | ~0.5x |
| Zero-Shot | 0.194 | ~5x faster | ~0.5x |
VRAM Requirements (BF16)
| Size | CTC | LLM Standard | LLM Unlimited | Zero-Shot |
|---|---|---|---|---|
| 300M | 2 GiB | 5 GiB | 5 GiB | - |
| 1B | 3 GiB | 6 GiB | 6 GiB | - |
| 3B | 8 GiB | 10 GiB | 10 GiB | - |
| 7B | 15 GiB | 17 GiB | 17 GiB | 20 GiB |
Accuracy Performance
The 7B LLM model achieves:- Character Error Rate (CER) < 10% for 78% of 1,600+ languages
- State-of-the-art multilingual ASR performance
- Improved results with language conditioning
Model Selection Guide
By Use Case
- Production Deployment
- Edge / Mobile
- Long-Form Content
- New Languages
High Throughput:
omniASR_CTC_7B_v2- 16x faster than real-time
- Best CTC accuracy
- Parallel processing
omniASR_LLM_1B_v2- Good accuracy
- Moderate VRAM (6 GiB)
- Language conditioning
omniASR_LLM_7B_v2- State-of-the-art CER
- Full language support
- Requires 17 GiB VRAM
By Available Resources
| VRAM Available | Recommended Model | Use Case |
|---|---|---|
| < 4 GiB | omniASR_CTC_300M_v2 | Edge deployment |
| 4-8 GiB | omniASR_CTC_1B_v2 or omniASR_LLM_300M_v2 | Consumer GPUs |
| 8-12 GiB | omniASR_CTC_3B_v2 or omniASR_LLM_1B_v2 | Mid-range production |
| 12-16 GiB | omniASR_LLM_3B_v2 | High-quality production |
| 16-20 GiB | omniASR_LLM_7B_v2 or omniASR_LLM_Unlimited_7B_v2 | Maximum accuracy |
| 20+ GiB | omniASR_LLM_7B_ZS | Zero-shot learning |
Model Download & Storage
Automatic Download
Storage Location
All models and tokenizers are cached in:Manual Download
Direct download links provided in the specification tables above. Example:Version History
December 2025 Update (v2 Models)
v2 Release Changes
v2 Release Changes
New Models:
- CTC v2: Improved character error rates
- LLM v2: Better accuracy across all sizes
- LLM Unlimited v2: Support for unlimited audio length
- Expanded vocabulary (10,288 tokens vs 9,812)
- New tokenizer:
omniASR_tokenizer_written_v2 - Updated training data and procedures
- Segmented processing for long audio (Unlimited variants)
- Unlimited models: Fine-tuning recipes not yet supported
- Unlimited models: Not described in original research paper
Original Release
- W2V models (4 sizes)
- CTC v1 models (4 sizes)
- LLM v1 models (4 sizes)
- Zero-Shot model (1 model)
Technical Details
Architecture Components
- Wav2Vec2 Encoder
- CTC Projection
- Llama Decoder
- Zero-Shot Context
Feature Extractor:
- CNN-based architecture
- Downsampling: ~320x (16kHz → 50Hz)
- Output: Frame-level features
- Sizes: 12/24/36/48 layers (300M/1B/3B/7B)
- Dimensions: 1024/1280/2048/2048
- Self-attention with positional encoding
- Output: Contextualized audio embeddings
Input Preprocessing
All models use the same preprocessing pipeline:- Audio Decoding: WAV, FLAC, MP3, etc. → raw waveform
- Resampling: Any sample rate → 16kHz
- Channel Mixing: Stereo/multi-channel → mono
- Normalization: Amplitude normalization
- Length Validation: Check max length constraints
Common Limitations
Audio Length Constraints
Audio Length Constraints
- CTC/LLM Standard: 40 seconds maximum
- LLM Unlimited: No limit (processes in 15s segments)
- Zero-Shot: 60 seconds max (30s recommended for context)
No Punctuation or Capitalization
No Punctuation or Capitalization
Models output spoken-form text without punctuation or capitalization.Workaround: Use third-party punctuation restoration libraries like deepmultilingualpunctuation
Limited Script Support
Limited Script Support
While supporting 1,600+ languages, some rare scripts may have limited training data.Workaround: Use zero-shot model with examples in target script
No Real-Time Streaming (Current)
No Real-Time Streaming (Current)
Unlimited models support segmented processing but the inference pipeline doesn’t expose streaming API.Future: Underlying checkpoints can be extended for streaming applications
Quick Reference
Model Naming Convention
Key Metrics Summary
| Metric | CTC 300M | CTC 7B | LLM 1B | LLM 7B | LLM Unl. 7B | ZS 7B |
|---|---|---|---|---|---|---|
| Params | 325M | 6.5B | 2.3B | 7.8B | 7.8B | 7.8B |
| VRAM | 2 GiB | 15 GiB | 6 GiB | 17 GiB | 17 GiB | 20 GiB |
| Speed | 96x | 16x | 1x | 1x | 1x (0.5x long) | 0.5x |
| Max Audio | 40s | 40s | 40s | 40s | Unlimited | 60s |
| Lang Cond. | No | No | Yes | Yes | Yes | Via context |
Next Steps
CTC Models
Detailed guide to parallel ASR models
LLM Models
Autoregressive models with language conditioning
Zero-Shot
In-context learning for new languages
Inference Guide
Start transcribing with our models