Architecture Overview
Omnilingual ASR models follow a hierarchical design with three main families:- W2V (Encoder)
- CTC (Encoder + Projection)
- LLM (Encoder-Decoder)
Wav2Vec2 Self-Supervised LearningFoundation encoder producing rich contextualized audio representations.
Model Families
W2V Models
Self-supervised learning models that serve as the foundation for all other architectures.Feature Extractor
CNN layers downsample raw audio by ~320x:
- Input: 16kHz waveform
- Output: Frame-level features (~50 fps)
- Reduces sequence length for efficient processing
- Pre-trained with self-supervised learning
- Contextual audio embeddings
- Foundation for CTC and LLM variants
- Ideal for building custom architectures
- Starting point for fine-tuning
- Building custom ASR architectures
- Audio feature extraction
- Transfer learning applications
CTC Models
Non-autoregressive models using Connectionist Temporal Classification for parallel prediction.CTC Alignment
CTC Alignment
CTC allows the model to predict multiple frames per character, then collapses repeated predictions:Blank tokens (
<blank>) separate repeated characters.Vocabulary
Vocabulary
- Size: 9812 or 10288 tokens depending on version
- Coverage: 1600+ languages
- Tokenization: Character-level or subword units
- Special tokens:
<pad>,<unk>,<blank>
- Fast: Parallel prediction (no autoregression)
- Efficient: Single forward pass
- Lightweight: Simple linear projection
- On-device friendly: Lower memory footprint
- No language conditioning
- No context example support
- Lower accuracy than LLM models
- Code: fairseq2/models/wav2vec2
- Configs: Custom training data configurations
LLM Models
Encoder-decoder architecture with LLaMA-based autoregressive decoder.Audio Encoder
Wav2Vec2 encoder produces audio embeddings
- Output dimensions: 1024 (300M), 1280 (1B), 2048 (3B/7B)
Projection Layer
Linear projection to LLaMA decoder space
- Projects to 4096 dimensions
- Aligns audio with text representations
LLaMA Decoder
Autoregressive transformer decoder
- Dimensions: 4096
- Generates text tokens sequentially
- Supports language conditioning and context
- LLM+LID (Language Conditioning)
- LLM+LID Unlimited Length
- LLM+ZS (Zero-Shot)
Language-aware models with optional language ID conditioning.Training Strategy:Available Models:
- 80% samples with language ID tokens
- 20% samples without language ID
- Robust performance in both scenarios
omniASR_LLM_300M_v2omniASR_LLM_1B_v2omniASR_LLM_3B_v2omniASR_LLM_7B_v2
Wav2Vec2LlamaModel implementation performs input validation at every forward pass:
.example field of Seq2SeqBatch for flexibility.
Vocabulary:
- Sizes: 9812 / 9818 / 10288 tokens (variant-dependent)
- Coverage: 1600+ languages
- Special tokens: Language ID tokens, padding, unknown, etc.
Model Size Comparison
300M Parameters
Encoder:
- Embedding dim: 1024
- Transformer layers: 12
- Attention heads: 16
1B Parameters
Encoder:
- Embedding dim: 1280
- Transformer layers: 24
- Attention heads: 16
3B Parameters
Encoder:
- Embedding dim: 2048
- Transformer layers: 36
- Attention heads: 32
7B Parameters
Encoder:
- Embedding dim: 2048
- Transformer layers: 48
- Attention heads: 32
Implementation Details
fairseq2 Integration
All models leverage fairseq2’s configuration system:- W2V/CTC: fairseq2/models/wav2vec2
- LLM: wav2vec2_llama/model.py
- Configs: model asset cards
Syntax and Grammar
LLM models use special syntax for different input combinations. The grammar is defined increate_syntax() functions:
Syntax Examples
Syntax Examples
Model Selection Guide
Determine Your Use Case
- High throughput? → CTC models
- Best accuracy? → LLM models (3B/7B)
- Unseen languages? → Zero-shot (7B ZS)
- Long audio? → LLM Unlimited models
Consider Resource Constraints
- Mobile/Edge? → 300M CTC
- Server/Cloud? → 1B/3B/7B LLM
- GPU memory? → Smaller models for limited memory
Evaluate Language Requirements
- Single language? → CTC sufficient
- Multiple languages? → LLM with language codes
- Rare languages? → Zero-shot model
Next Steps
Inference Guide
Learn how to use these models for transcription
Training Guide
Fine-tune models on your own data
Research Paper
Read the full technical details
GitHub Repository
Explore the source code