Skip to main content

Complete Model Comparison

Omnilingual ASR offers 27 models across four families: W2V (self-supervised), CTC (parallel ASR), LLM (autoregressive ASR), and Zero-Shot models.
All VRAM and speed metrics measured on A100 GPU with BF16 precision, batch size 1, and 30-second audio (unless noted otherwise).

Model Families Overview

W2V Models

Self-Supervised Learning (SSL)Pre-trained audio encoders producing contextualized embeddings. Useful as starting points for custom architectures.
  • 4 sizes: 300M, 1B, 3B, 7B
  • No direct transcription
  • Foundation for CTC/LLM models

CTC Models

Parallel ASRHigh-speed speech recognition with parallel generation. Ideal for production deployments requiring throughput.
  • 4 sizes × 2 versions = 8 models
  • 16x-96x faster than real-time
  • No language conditioning

LLM Models

Autoregressive ASRState-of-the-art accuracy with language conditioning. Standard and Unlimited length variants.
  • 4 sizes × 3 variants = 12 models
  • Optional language conditioning
  • Unlimited length support (v2)

Zero-Shot

In-Context LearningTranscribe unseen languages using 1-10 audio-text example pairs.
  • 1 model (7B)
  • Requires context examples
  • Ideal for low-resource languages

Complete Specifications Table

W2V Models (Self-Supervised)

ModelParametersDownloadFeaturesEmbedding Dim
omniASR_W2V_300M317,390,5921.2 GiBSSL1024
omniASR_W2V_1B965,514,7523.6 GiBSSL1280
omniASR_W2V_3B3,064,124,67212.0 GiBSSL2048
omniASR_W2V_7B6,488,487,16825.0 GiBSSL2048
Input: Raw audio waveform (16kHz)Output: Contextualized audio embeddings
  • 300M: 1024-dimensional vectors
  • 1B: 1280-dimensional vectors
  • 3B/7B: 2048-dimensional vectors
Use Case: Building custom architectures, transfer learning, feature extractionNot Recommended For: Direct transcription (use CTC or LLM models instead)

CTC Models (Parallel ASR)

Version 1 Models

ModelParametersDownloadVRAMRTFSpeedVocab
omniASR_CTC_300M325,494,9961.3 GiB~2 GiB0.00196x9,812
omniASR_CTC_1B975,065,3003.7 GiB~3 GiB0.00248x9,812
omniASR_CTC_3B3,080,423,63612.0 GiB~8 GiB0.00332x9,812
omniASR_CTC_7B6,504,786,13225.0 GiB~15 GiB0.00616x9,812

Version 2 Models (Improved CER)

ModelParametersDownloadVRAMRTFSpeedVocab
omniASR_CTC_300M_v2325,494,9961.3 GiB~2 GiB0.00196x10,288
omniASR_CTC_1B_v2975,065,3003.7 GiB~3 GiB0.00248x10,288
omniASR_CTC_3B_v23,080,423,63612.0 GiB~8 GiB0.00332x10,288
omniASR_CTC_7B_v26,504,786,13225.0 GiB~15 GiB0.00616x10,288
Features: Parallel generation, non-autoregressive decodingTokenizers:
  • v1 models: omniASR_tokenizer_v1 (9,812 tokens)
  • v2 models: omniASR_tokenizer_written_v2 (10,288 tokens)
Max Audio Length: 40 secondsLanguage Conditioning: Not supported (parameter ignored)Best For: High-throughput production, on-device deployment, batch processingv2 Improvements: Better character error rates (CER), expanded vocabulary

LLM Models (Autoregressive ASR)

Standard LLM - Version 1

ModelParametersDownloadVRAMRTFVocabMax Audio
omniASR_LLM_300M1,627,603,5846.1 GiB~5 GiB0.0909,81240s
omniASR_LLM_1B2,275,710,5928.5 GiB~6 GiB0.0919,81240s
omniASR_LLM_3B4,376,679,04017.0 GiB~10 GiB0.0939,81240s
omniASR_LLM_7B7,801,041,53630.0 GiB~17 GiB0.0929,81840s

Standard LLM - Version 2 (Improved CER)

ModelParametersDownloadVRAMRTFVocabMax Audio
omniASR_LLM_300M_v21,627,603,5846.1 GiB~5 GiB0.09010,28840s
omniASR_LLM_1B_v22,275,710,5928.5 GiB~6 GiB0.09110,28840s
omniASR_LLM_3B_v24,376,679,04017.0 GiB~10 GiB0.09310,28840s
omniASR_LLM_7B_v27,801,041,53630.0 GiB~17 GiB0.09210,28840s

Unlimited Length LLM - Version 2

ModelParametersDownloadVRAMRTF (30s)RTF (15min)Max Audio
omniASR_LLM_Unlimited_300M_v21,627,603,5846.1 GiB~5 GiB0.0920.206Unlimited
omniASR_LLM_Unlimited_1B_v22,275,710,5928.5 GiB~6 GiB0.0970.207Unlimited
omniASR_LLM_Unlimited_3B_v24,376,679,04017.0 GiB~10 GiB0.0950.208Unlimited
omniASR_LLM_Unlimited_7B_v27,801,041,53630.0 GiB~17 GiB0.0970.208Unlimited
Features:
  • Optional language conditioning (80/20 training split with/without)
  • Autoregressive beam search decoding
  • State-of-the-art accuracy
Tokenizers:
  • v1 models (300M/1B/3B): omniASR_tokenizer_v1
  • v1 model (7B): omniASR_tokenizer_v1_variant7
  • v2 models: omniASR_tokenizer_written_v2
Unlimited Length Models:
  • Segment size: 15 seconds
  • Context window: 1 previous segment
  • Accuracy comparable to standard LLM models
  • Fine-tuning not currently supported
  • Can be extended for streaming applications
Best For:
  • Standard: Maximum accuracy, known languages, audio under 40s
  • Unlimited: Long-form content (podcasts, lectures, meetings)

Zero-Shot Model

ModelParametersDownloadVRAMRTFVocabContext Required
omniASR_LLM_7B_ZS7,810,900,60830.0 GiB~20 GiB0.1949,8121-10 examples
Features: In-context learning with audio-text example pairsTokenizer: omniASR_tokenizer_v1Context Examples:
  • Minimum: 1 example (repeated to 10)
  • Maximum: 10 examples
  • Recommended: 5-10 diverse examples
  • Max length per example: 30 seconds
Target Audio: Up to 60 seconds (40s recommended)Use Case: Unseen languages, low-resource scenarios, domain adaptationLimitations: Higher VRAM usage, slower inference, requires context

Tokenizers

TokenizerSizeUsed ByVocab Size
omniASR_tokenizer_v1100 KiBW2V, CTC v1, LLM v1 (300M/1B/3B), ZS9,812
omniASR_tokenizer_v1_variant7100 KiBLLM v1 (7B only)9,818
omniASR_tokenizer_written_v2100 KiBCTC v2, LLM v2, LLM Unlimited v210,288

Performance Metrics

Speed Comparison (Real-Time Factor)

RTF (Real-Time Factor): Time to process 1 second of audio. Lower is faster.
  • RTF = 0.001: 1000x faster than real-time (1s audio in 0.001s)
  • RTF = 1.0: Real-time processing (1s audio in 1s)
  • RTF = 0.092: ~11x faster than real-time (1s audio in 0.092s)
Model FamilyRTF RangeSpeed vs Real-TimeRelative to LLM 7B
CTC 300M0.0011000x faster96x faster
CTC 1B0.002500x faster48x faster
CTC 3B0.003333x faster32x faster
CTC 7B0.006167x faster16x faster
LLM (all)0.090-0.093~11x faster1x (baseline)
LLM Unlimited (30s)0.092-0.097~11x faster~1x
LLM Unlimited (15min)0.206-0.208~5x faster~0.5x
Zero-Shot0.194~5x faster~0.5x

VRAM Requirements (BF16)

SizeCTCLLM StandardLLM UnlimitedZero-Shot
300M2 GiB5 GiB5 GiB-
1B3 GiB6 GiB6 GiB-
3B8 GiB10 GiB10 GiB-
7B15 GiB17 GiB17 GiB20 GiB
VRAM Scaling: These values are for batch_size=1 with 30-second audio. Larger batches and longer audio will require proportionally more memory.

Accuracy Performance

The 7B LLM model achieves:
  • Character Error Rate (CER) < 10% for 78% of 1,600+ languages
  • State-of-the-art multilingual ASR performance
  • Improved results with language conditioning
See per-language CER results for detailed metrics.

Model Selection Guide

By Use Case

High Throughput: omniASR_CTC_7B_v2
  • 16x faster than real-time
  • Best CTC accuracy
  • Parallel processing
Balanced: omniASR_LLM_1B_v2
  • Good accuracy
  • Moderate VRAM (6 GiB)
  • Language conditioning
Maximum Accuracy: omniASR_LLM_7B_v2
  • State-of-the-art CER
  • Full language support
  • Requires 17 GiB VRAM

By Available Resources

VRAM AvailableRecommended ModelUse Case
< 4 GiBomniASR_CTC_300M_v2Edge deployment
4-8 GiBomniASR_CTC_1B_v2 or omniASR_LLM_300M_v2Consumer GPUs
8-12 GiBomniASR_CTC_3B_v2 or omniASR_LLM_1B_v2Mid-range production
12-16 GiBomniASR_LLM_3B_v2High-quality production
16-20 GiBomniASR_LLM_7B_v2 or omniASR_LLM_Unlimited_7B_v2Maximum accuracy
20+ GiBomniASR_LLM_7B_ZSZero-shot learning

Model Download & Storage

Automatic Download

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Models automatically downloaded on first use
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B_v2")

Storage Location

All models and tokenizers are cached in:
~/.cache/fairseq2/assets/
See fairseq2 asset store documentation for details.

Manual Download

Direct download links provided in the specification tables above. Example:
# Download model
wget https://dl.fbaipublicfiles.com/mms/omniASR-LLM-7B-v2.pt

# Download tokenizer
wget https://dl.fbaipublicfiles.com/mms/omniASR_tokenizer_written_v2.model

Version History

December 2025 Update (v2 Models)

New Models:
  • CTC v2: Improved character error rates
  • LLM v2: Better accuracy across all sizes
  • LLM Unlimited v2: Support for unlimited audio length
Key Improvements:
  • Expanded vocabulary (10,288 tokens vs 9,812)
  • New tokenizer: omniASR_tokenizer_written_v2
  • Updated training data and procedures
  • Segmented processing for long audio (Unlimited variants)
Limitations:
  • Unlimited models: Fine-tuning recipes not yet supported
  • Unlimited models: Not described in original research paper

Original Release

  • W2V models (4 sizes)
  • CTC v1 models (4 sizes)
  • LLM v1 models (4 sizes)
  • Zero-Shot model (1 model)

Technical Details

Architecture Components

Feature Extractor:
  • CNN-based architecture
  • Downsampling: ~320x (16kHz → 50Hz)
  • Output: Frame-level features
Transformer Encoder:
  • Sizes: 12/24/36/48 layers (300M/1B/3B/7B)
  • Dimensions: 1024/1280/2048/2048
  • Self-attention with positional encoding
  • Output: Contextualized audio embeddings

Input Preprocessing

All models use the same preprocessing pipeline:
  1. Audio Decoding: WAV, FLAC, MP3, etc. → raw waveform
  2. Resampling: Any sample rate → 16kHz
  3. Channel Mixing: Stereo/multi-channel → mono
  4. Normalization: Amplitude normalization
  5. Length Validation: Check max length constraints
# From pipeline.py preprocessing
- Decode audio (if file/bytes)
- Resample to 16kHz
- Convert to mono
- Normalize amplitude
- Validate max length (40s for CTC/LLM, 60s for ZS)

Common Limitations

  • CTC/LLM Standard: 40 seconds maximum
  • LLM Unlimited: No limit (processes in 15s segments)
  • Zero-Shot: 60 seconds max (30s recommended for context)
Workaround: Split longer audio or use Unlimited models
Models output spoken-form text without punctuation or capitalization.Workaround: Use third-party punctuation restoration libraries like deepmultilingualpunctuation
While supporting 1,600+ languages, some rare scripts may have limited training data.Workaround: Use zero-shot model with examples in target script
Unlimited models support segmented processing but the inference pipeline doesn’t expose streaming API.Future: Underlying checkpoints can be extended for streaming applications

Quick Reference

Model Naming Convention

omniASR_{FAMILY}_{SIZE}_{VARIANT}

FAMILY: W2V | CTC | LLM | LLM_Unlimited
SIZE:   300M | 1B | 3B | 7B
VARIANT: (none) | v2 | ZS

Examples:
- omniASR_CTC_3B_v2
- omniASR_LLM_Unlimited_7B_v2
- omniASR_LLM_7B_ZS

Key Metrics Summary

MetricCTC 300MCTC 7BLLM 1BLLM 7BLLM Unl. 7BZS 7B
Params325M6.5B2.3B7.8B7.8B7.8B
VRAM2 GiB15 GiB6 GiB17 GiB17 GiB20 GiB
Speed96x16x1x1x1x (0.5x long)0.5x
Max Audio40s40s40s40sUnlimited60s
Lang Cond.NoNoYesYesYesVia context

Next Steps

CTC Models

Detailed guide to parallel ASR models

LLM Models

Autoregressive models with language conditioning

Zero-Shot

In-context learning for new languages

Inference Guide

Start transcribing with our models

Build docs developers (and LLMs) love