Skip to main content
Header image with a collage of on-the-ground photos from the transcription gathering efforts in Pakistan and Liberia

Welcome to Omnilingual ASR

Omnilingual ASR is an open-source speech recognition system supporting over 1,600 languages — including hundreds never previously covered by any ASR technology. Designed for broad accessibility, it enables new languages to be added with just a few paired examples without requiring specialized expertise or large datasets. By combining scalable zero-shot learning with a flexible model family, Omnilingual ASR aims to make speech technology more inclusive and adaptable for communities and researchers worldwide.

Quick Start

Get transcribing audio in minutes with simple Python code

Installation

Install via pip or uv and set up your environment

Supported Languages

Explore 1600+ languages with language ID format reference

Model Architecture

Learn about W2V, CTC, and LLM model families

Key Features

State-of-the-art performance across 1,600+ languages with character error rates (CER) below 10% for 78% of those languages. Includes hundreds of languages never previously covered by ASR technology.
Choose from three model architectures:
  • CTC Models: Fast parallel generation (up to 96x real-time)
  • LLM Models: Language-conditioned transcription with optional context
  • Zero-Shot Models: Transcribe new languages with just a few examples
Available in 300M, 1B, 3B, and 7B parameter variants to balance accuracy and computational resources.
Add support for new languages with just a few paired audio-text examples using the zero-shot model variant.

Performance Highlights

Performance results table Our 7B-LLM-ASR system achieves state-of-the-art performance across 1,600+ languages, with character error rates (CER) below 10 for 78% of those languages.
Per-language CER results and training hours can be found in the per_language_results_table_7B_llm_asr.csv file.

Model Architecture Overview

Omnilingual ASR provides three distinct model families:

SSL (Self-Supervised Learning) Models

Pre-trained wav2vec2 encoders that learn speech representations without transcriptions. These serve as the foundation for both CTC and LLM models.

CTC (Connectionist Temporal Classification) Models

Fast, parallel transcription models ideal for real-time applications:
  • Real-Time Factor: 0.001-0.006 (16x to 96x faster than real-time)
  • Use Case: High-throughput batch processing
  • Limitation: No language conditioning support

LLM (Language Model) Models

Autoregressive models with optional language conditioning:
  • Real-Time Factor: ~0.09 (approximately real-time)
  • Language Conditioning: Specify target language for better accuracy
  • Unlimited Variants: Support for audio of any length
  • Zero-Shot Variant: Learn new languages from context examples

December 2025 Update

We released two major improvements:
  1. Improved v2 Models: Enhanced accuracy (CER) for both CTC and LLM-ASR models
  2. Unlimited Audio Length: New LLM variant supporting transcription of unlimited-length audio (omniASR_LLM_Unlimited_{300M,1B,3B,7B}_v2)
Unlimited audio length models have comparable accuracy to limited-length models, however finetuning recipes for these models are currently not supported.

Resources

Research Paper

Read the full technical paper

Blog Post

Learn about the research and development

HuggingFace Demo

Try the interactive demo

Dataset

Access the multilingual corpus

Next Steps

1

Install the Package

Follow the installation guide to set up Omnilingual ASR in your environment.
2

Run Your First Transcription

Try the quick start guide to transcribe your first audio file.
3

Explore Advanced Features

License

Omnilingual ASR code and models are released under the Apache 2.0 License.

Build docs developers (and LLMs) love