Skip to main content
Explore the hierarchical architecture design of Omnilingual ASR models, built on the Wav2Vec2 encoder foundation.

Architecture Overview

Omnilingual ASR models follow a hierarchical design with three main families:
Wav2Vec2 Self-Supervised Learning
[Audio 16kHz] → Feature Extractor → Wav2Vec2 Encoder → [Audio Embeddings]
                (CNN ~320x)         (Transformer)      (1024/1280/2048-dim)
Foundation encoder producing rich contextualized audio representations.

Model Families

W2V Models

Self-supervised learning models that serve as the foundation for all other architectures.
# Input
raw_audio: Tensor  # Shape: [batch_size, audio_length], 16kHz waveform

# Output  
audio_embeddings: Tensor  # Shape: [batch_size, time_steps, embed_dim]
                          # embed_dim: 1024 (300M), 1280 (1B), 2048 (3B/7B)
Architecture Components:
1

Feature Extractor

CNN layers downsample raw audio by ~320x:
  • Input: 16kHz waveform
  • Output: Frame-level features (~50 fps)
  • Reduces sequence length for efficient processing
2

Transformer Encoder

Contextual encoding via multi-head self-attention:
  • Layers: 12 (300M), 24 (1B), 36 (3B), 48 (7B)
  • Attention heads: 16 (300M/1B), 32 (3B/7B)
  • Creates rich audio representations
Key Features:
  • Pre-trained with self-supervised learning
  • Contextual audio embeddings
  • Foundation for CTC and LLM variants
  • Ideal for building custom architectures
Use Cases:
  • Starting point for fine-tuning
  • Building custom ASR architectures
  • Audio feature extraction
  • Transfer learning applications

CTC Models

Non-autoregressive models using Connectionist Temporal Classification for parallel prediction.
# Input
raw_audio: Tensor  # Shape: [batch_size, audio_length], 16kHz

# Output
vocab_logits: Tensor  # Shape: [batch_size, time_steps, vocab_size]
                      # vocab_size: 9812 or 10288 tokens
Architecture:
Wav2Vec2 Encoder Output (1024/1280/2048-dim)

Linear Projection Layer

Vocabulary Logits (9812/10288 tokens)

CTC Alignment & Decoding

Final Transcription
CTC allows the model to predict multiple frames per character, then collapses repeated predictions:
Raw predictions:  [h, h, e, e, l, l, l, o, o]
After collapse:   [h, e, l, o]
Final output:     "hello"
Blank tokens (<blank>) separate repeated characters.
  • Size: 9812 or 10288 tokens depending on version
  • Coverage: 1600+ languages
  • Tokenization: Character-level or subword units
  • Special tokens: <pad>, <unk>, <blank>
Advantages:
  • Fast: Parallel prediction (no autoregression)
  • Efficient: Single forward pass
  • Lightweight: Simple linear projection
  • On-device friendly: Lower memory footprint
Limitations:
  • No language conditioning
  • No context example support
  • Lower accuracy than LLM models
Implementation: CTC models use fairseq2’s existing implementations with updated configurations:

LLM Models

Encoder-decoder architecture with LLaMA-based autoregressive decoder.
# Standard Input
raw_audio: Tensor              # [batch, audio_length], 16kHz
lang_codes: List[str] | None   # Optional: ["eng_Latn", "deu_Latn", ...]

# Zero-Shot Input
raw_audio: Tensor
context_audio: List[Tensor]    # 10 context examples
context_text: List[str]        # Corresponding transcriptions

# Output
transcription: str             # Autoregressive beam search output
Architecture Components:
1

Audio Encoder

Wav2Vec2 encoder produces audio embeddings
  • Output dimensions: 1024 (300M), 1280 (1B), 2048 (3B/7B)
2

Projection Layer

Linear projection to LLaMA decoder space
  • Projects to 4096 dimensions
  • Aligns audio with text representations
3

LLaMA Decoder

Autoregressive transformer decoder
  • Dimensions: 4096
  • Generates text tokens sequentially
  • Supports language conditioning and context
4

Beam Search

Decodes output logits to final transcription
  • Configurable beam width
  • Length normalization
  • Diverse beam search options
Model Variants:
Language-aware models with optional language ID conditioning.Training Strategy:
  • 80% samples with language ID tokens
  • 20% samples without language ID
  • Robust performance in both scenarios
Input Format:
# With language conditioning
[<lang:eng_Latn>] + audio_embeddings → decoder → "hello world"

# Without language conditioning  
audio_embeddings → decoder → "hello world"
Available Models:
  • omniASR_LLM_300M_v2
  • omniASR_LLM_1B_v2
  • omniASR_LLM_3B_v2
  • omniASR_LLM_7B_v2
Input Validation: The Wav2Vec2LlamaModel implementation performs input validation at every forward pass:
# Validation via ensure_valid_forward_inputs()

if has_language_tokens and has_context:
    raise ValueError("Cannot use both language ID and context")
    
if has_context and len(context_examples) != 10:
    raise ValueError("Zero-shot requires exactly 10 context examples")
Additional inputs are encoded in the .example field of Seq2SeqBatch for flexibility. Vocabulary:
  • Sizes: 9812 / 9818 / 10288 tokens (variant-dependent)
  • Coverage: 1600+ languages
  • Special tokens: Language ID tokens, padding, unknown, etc.

Model Size Comparison

300M Parameters

Encoder:
  • Embedding dim: 1024
  • Transformer layers: 12
  • Attention heads: 16
Best for: Resource-constrained environments, mobile, edge devices

1B Parameters

Encoder:
  • Embedding dim: 1280
  • Transformer layers: 24
  • Attention heads: 16
Best for: Balanced accuracy/efficiency, general-purpose ASR

3B Parameters

Encoder:
  • Embedding dim: 2048
  • Transformer layers: 36
  • Attention heads: 32
Best for: High-accuracy requirements, multilingual scenarios

7B Parameters

Encoder:
  • Embedding dim: 2048
  • Transformer layers: 48
  • Attention heads: 32
Best for: Maximum accuracy, research, zero-shot learning

Implementation Details

fairseq2 Integration

All models leverage fairseq2’s configuration system:
# Existing fairseq2 implementations
from fairseq2.models.wav2vec2 import Wav2Vec2Model
from fairseq2.models.wav2vec2.asr import Wav2Vec2AsrModel

# With custom configs for Omnilingual training
model = load_model("omniASR_CTC_1B_v2")
Key Files:

Syntax and Grammar

LLM models use special syntax for different input combinations. The grammar is defined in create_syntax() functions:
# Language-conditioned syntax
<lang:eng_Latn> <audio_tokens...> <text_tokens...>

# Zero-shot syntax with context
<context_1_audio> <context_1_text> 
<context_2_audio> <context_2_text>
# ... (10 context pairs)
<target_audio> <text_tokens...>
See model.py for full implementation.

Model Selection Guide

1

Determine Your Use Case

  • High throughput? → CTC models
  • Best accuracy? → LLM models (3B/7B)
  • Unseen languages? → Zero-shot (7B ZS)
  • Long audio? → LLM Unlimited models
2

Consider Resource Constraints

  • Mobile/Edge? → 300M CTC
  • Server/Cloud? → 1B/3B/7B LLM
  • GPU memory? → Smaller models for limited memory
3

Evaluate Language Requirements

  • Single language? → CTC sufficient
  • Multiple languages? → LLM with language codes
  • Rare languages? → Zero-shot model
4

Test and Optimize

  • Start with 1B model as baseline
  • Compare CTC vs LLM accuracy
  • Scale up/down based on results

Next Steps

Inference Guide

Learn how to use these models for transcription

Training Guide

Fine-tune models on your own data

Research Paper

Read the full technical details

GitHub Repository

Explore the source code

Build docs developers (and LLMs) love