Overview
The LLM-based ASR models combine a Wav2Vec2 encoder with a Llama decoder for autoregressive text generation. These models support optional language conditioning and offer the highest transcription accuracy across 1,600+ languages. The December 2025 update introduced “Unlimited” variants that can process audio of any length.LLM models achieve state-of-the-art performance with character error rates (CER) below 10% for 78% of the 1,600+ supported languages when using the 7B variant.
Architecture
The LLM model family uses an encoder-decoder architecture:Key Components
- Wav2Vec2 Encoder: Produces contextualized audio embeddings (1024/1280/2048-dim depending on model size)
- Linear Projection: Projects audio embeddings to match Llama decoder’s 4096-dimensional input space
- Llama Decoder: Autoregressive transformer decoder for text generation
- Final Projection: Maps decoder outputs to vocabulary logits
- Beam Search: Generates multiple hypotheses and selects the best transcription
Model Variants
Standard LLM Models (with Language Conditioning)
- 300M
- 1B
- 3B
- 7B
omniASR_LLM_300M / omniASR_LLM_300M_v2
- Parameters: 1,627,603,584
- Download Size: 6.1 GiB (FP32)
- Inference VRAM: ~5 GiB (BF16, batch=1, 30s audio)
- Speed: ~1x real-time (RTF: 0.090)
- Audio Embedding: 1024-dim
- Decoder Dimension: 4096-dim
- Vocabulary Size: 9,812 (v1) / 10,288 (v2)
- Features: Optional language conditioning
Unlimited Length Models
Released in December 2025, these variants support transcription of unlimited audio length:- 300M Unlimited
- 1B Unlimited
- 3B Unlimited
- 7B Unlimited
omniASR_LLM_Unlimited_300M_v2
- Parameters: 1,627,603,584
- Max Audio Length: Unlimited
- Segment Size: 15 seconds
- Context Window: 1 previous segment
- Speed (30s): RTF 0.092 (~1x)
- Speed (15min): RTF 0.206 (~0.5x)
- VRAM: ~5 GiB
Language Conditioning
LLM models support optional language identification to improve transcription quality:Without Language Conditioning
The models were trained with an 80/20 split of samples with and without language IDs, enabling robust performance in both scenarios. However, providing language codes is recommended for best results.
With Language Conditioning (Recommended)
Language Code Format
Languages follow the format{language_code}_{script}:
eng_Latn- English (Latin script)cmn_Hans- Mandarin Chinese (Simplified)cmn_Hant- Mandarin Chinese (Traditional)rus_Cyrl- Russian (Cyrillic script)ara_Arab- Arabic (Arabic script)hin_Deva- Hindi (Devanagari script)
Unlimited Length Models
How It Works
Unlimited models use a segmented approach with context:- Segmentation: Audio split into 15-second segments
- Contextual Decoding: Each segment uses embeddings from the previous segment
- Iterative Processing: Segments decoded sequentially with rolling context
- Text Accumulation: Transcriptions concatenated to form complete output
Usage Example
Usage Patterns
Basic Transcription
Mixed Language Batch
HuggingFace Dataset Integration
Custom Beam Search Configuration
Autoregressive Generation
Unlike CTC models, LLM models generate text sequentially (token-by-token):- Audio Encoding: Wav2Vec2 encoder processes full audio
- Projection: Audio embeddings projected to Llama space (4096-dim)
- Decoder Context: Optional language ID or previous segments added
- Beam Search: Generate multiple hypotheses autoregressively
- Selection: Best hypothesis selected based on beam search score
- Language modeling: Better fluency and grammar
- Context awareness: Uses previous tokens to inform generation
- Flexibility: Supports language conditioning and context examples
Performance Characteristics
Speed vs. Accuracy Trade-off
| Model Size | RTF (30s) | Accuracy | VRAM | Best Use Case |
|---|---|---|---|---|
| 300M | 0.090 | Good | 5 GiB | Edge deployment, cost-sensitive |
| 1B | 0.091 | Better | 6 GiB | Balanced production |
| 3B | 0.093 | Great | 10 GiB | High-quality production |
| 7B | 0.092 | Best | 17 GiB | Research, maximum accuracy |
RTF (Real-Time Factor): ~0.09 means the model processes 1 second of audio in ~0.09 seconds (about real-time speed).
CER Performance
The 7B LLM model achieves:- CER < 10% for 78% of 1,600+ languages
- State-of-the-art results across diverse language families
- Improved performance with language conditioning
Input Validation
The model performs validation at every forward pass to ensure correct inputs:Input Validation Rules
Input Validation Rules
- Standard LLM Models: Accept audio with optional language codes
- Zero-Shot Model: Requires exactly 10 context examples (see Zero-Shot page)
- Unlimited Models: No audio length restriction
- Batch Format: Uses fairseq2
Seq2SeqBatchwith optional.examplefields
Model Selection Guide
Standard LLM
Use when:
- Audio is under 40 seconds
- Language is known or auto-detectable
- Need maximum accuracy
- Real-time processing acceptable
omniASR_LLM_7B_v2Unlimited LLM
Use when:
- Audio is >40 seconds (podcasts, lectures)
- Processing long-form content
- Need streaming capability (custom integration)
- Accuracy comparable to standard models
omniASR_LLM_Unlimited_7B_v2Smaller Models (300M/1B)
Use when:
- Limited GPU memory (under 8 GiB)
- Cost-sensitive deployment
- Faster processing preferred
- Moderate accuracy acceptable
CTC Models
Use when:
- Speed is critical (need 16x-96x faster)
- Language conditioning not needed
- On-device deployment
- Batch processing large volumes
Advanced Features
Custom Model Loading
Batch Size Optimization
Next Steps
Zero-Shot Models
Learn about in-context learning for unseen languages
Model Specifications
Detailed comparison of all model variants
CTC Models
Fast parallel generation for production
Inference Guide
Complete transcription workflows and examples