Skip to main content
Learn how to transcribe audio with our multilingual ASR models, from quick start to advanced usage patterns.

Quick Start

Get started with Omnilingual ASR inference in just a few lines:
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_CTC_1B_v2")
transcriptions = pipeline.transcribe(["/path/to/audio1.flac"], batch_size=1)
print(transcriptions[0])
The models were trained on audio durations of 30 seconds or less. We recommend keeping samples under 30 seconds for optimal performance.
Currently only audio files shorter than 40 seconds are accepted for inference.

Audio Input Formats

The inference pipeline accepts multiple input formats through the AudioInput type:
The simplest approach - provide paths to audio files:
audio_files = ["/path/to/audio.wav", "/path/to/audio.flac"]
transcriptions = pipeline.transcribe(audio_files, batch_size=2)
Supported formats: .wav, .flac

Audio Preprocessing

All audio inputs undergo automatic preprocessing:
1

Decode

Encoded audio (.wav/.flac) is decoded to raw waveforms
2

Resample

Audio is resampled to 16kHz for model compatibility
3

Convert to Mono

Multi-channel audio is converted to mono-channel
4

Normalize

Waveforms are normalized before model ingestion
We recommend keeping this preprocessing pipeline similar when integrating the model in downstream applications.

Batch Processing

Process multiple audio files efficiently with batching:
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_CTC_1B_v2")

audio_files = ["/path/to/audio1.flac", "/path/to/audio2.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=2)

for file, trans in zip(audio_files, transcriptions):
    print(f"{file}: {trans}")
Adjust batch_size based on your GPU memory. Larger batches increase throughput but require more memory.

Model Types

Omnilingual ASR offers three model families, each optimized for different use cases:

CTC Models

Parallel generation models optimized for speed and throughput.
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(
    model_card="omniASR_CTC_3B_v2", 
    device=None
)

audio_files = ["/path/to/audio1.flac", "/path/to/audio2.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=2)

for file_path, text in zip(audio_files, transcriptions):
    print(f"CTC transcription - {file_path}: {text}")
Key Features:
  • Fastest inference with parallel generation
  • No language conditioning support
  • No context example support
  • Ideal for on-device transcription
When to Use:
  • High-throughput scenarios
  • Real-time transcription needs
  • Resource-constrained environments
  • Single-language applications

LLM Models

Autoregressive models with language conditioning for enhanced accuracy.
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_Unlimited_1B_v2")

audio_files = [
    "/path/to/russian_audio.wav",
    "/path/to/english_audio.flac",
    "/path/to/german_audio.wav"
]

transcriptions = pipeline.transcribe(
    audio_files, 
    lang=["rus_Cyrl", "eng_Latn", "deu_Latn"], 
    batch_size=3
)
Available Variants:
Language-conditioned models with optional language identification:
  • omniASR_LLM_300M_v2
  • omniASR_LLM_1B_v2
  • omniASR_LLM_3B_v2
  • omniASR_LLM_7B_v2
Trained with 80/20 split of samples with/without language IDs.
Providing language codes improves transcription quality. Find language codes in lang_ids.py or the paper’s Appendix A.

Zero-Shot Models

In-context learning models for unseen languages using audio-text example pairs.
from omnilingual_asr.models.inference.pipeline import (
    ASRInferencePipeline,
    ContextExample
)

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B_ZS")

# Provide 1-10 context examples
context_examples = [
    ContextExample("/path/to/context_audio1.wav", "Hello world"),
    ContextExample("/path/to/context_audio2.wav", "How are you today"),
    ContextExample("/path/to/context_audio3.flac", "Nice to meet you")
]

transcriptions = pipeline.transcribe_with_context(
    ["/path/to/test_audio.wav"],
    context_examples=[context_examples],
    batch_size=1
)

print(f"Transcription: {transcriptions[0]}")
The model uses exactly 10 context slots internally. If fewer than 10 examples are provided, samples are duplicated sequentially to fill all slots. If more than 10 are provided, they’re cropped.
Context samples should be up to 30 seconds in length. The model supports a maximum audio length of 60 seconds but performs suboptimally with longer samples.
When to Use:
  • Transcribing rare or low-resource languages
  • Languages not in the training set
  • Domain-specific vocabulary or accents
  • Few-shot learning scenarios

Advanced Usage

Parquet Dataset Input

Use training-format parquet datasets directly for inference:
import pyarrow.parquet as pq
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

ds = pq.ParquetDataset("/path/to/dataset/")
batch_data = ds._dataset.head(10).to_pandas()  # First 10 samples
audio_bytes = batch_data["audio_bytes"].tolist()

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_1B_v2")
transcriptions = pipeline.transcribe(audio_bytes, batch_size=4)

for i, text in enumerate(transcriptions):
    print(f"Sample {i+1}: {text}")
See the Data Preparation Guide for parquet schema details.

HuggingFace Datasets

Integrate with HuggingFace datasets seamlessly:
from datasets import load_dataset
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Load dataset
omni_dataset = load_dataset(
    "facebook/omnilingual-asr-corpus", 
    "lij_Latn", 
    split="train", 
    streaming=True
)
batch = next(omni_dataset.iter(5))

# Convert to pipeline format
audio_data = [
    {"waveform": x["array"], "sample_rate": x["sampling_rate"]}
    for x in batch["audio"]
]

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_1B_v2")
transcriptions = pipeline.transcribe(audio_data, batch_size=2)

for i, text in enumerate(transcriptions):
    print(f"Sample {i+1}: {text}")

Model Input Format Specification

For advanced users integrating models with fairseq2’s Seq2SeqBatch interface:
The behavior of .forward() varies depending on batch structure. Extra fields in batch.example may change model interpretation.
from fairseq2.datasets.batch import Seq2SeqBatch

batch = Seq2SeqBatch(
    source_seqs=audio_tensor,        # [BS, T_audio, D_audio] - target audio
    source_seq_lens=audio_lengths,   # [BS] - actual audio lengths
    target_seqs=text_tensor,         # [BS, T_text] - target text tokens
    target_seq_lens=text_lengths,    # [BS] - actual text lengths
    example={}                       # Empty dict - no special fields
)
batch = Seq2SeqBatch(
    source_seqs=audio_tensor,        # [BS, T_audio, D_audio] - target audio
    source_seq_lens=audio_lengths,   # [BS] - actual audio lengths
    target_seqs=text_tensor,         # [BS, T_text] - target text tokens
    target_seq_lens=text_lengths,    # [BS] - actual text lengths
    example={
        "lang": ['mxs_Latn', ...]    # [BS] - language codes per sample
    }
)
Language codes must be from lang_ids.py.
batch = Seq2SeqBatch(
    source_seqs=audio_tensor,        # [BS, T_audio, D_audio] - target audio
    source_seq_lens=audio_lengths,   # [BS] - actual audio lengths
    target_seqs=text_tensor,         # [BS, T_text] - target text tokens
    target_seq_lens=text_lengths,    # [BS] - actual text lengths
    example={
        "context_audio": [           # List[Dict] - BS context examples
            {"seqs": context_audio_1, "seq_lens": [audio_len_1]},
            # ... more context audio
            {"seqs": context_audio_BS, "seq_lens": [audio_len_BS]},
        ],
        "context_text": [            # List[Dict] - BS context text
            {"seqs": context_text_1, "seq_lens": [text_len_1]},
            # ... more context text
            {"seqs": context_text_BS, "seq_lens": [text_len_BS]},
        ]
    }
)

Punctuation and Capitalization

Our models output transcripts in spoken form without punctuation or capitalization.
For written-form transcripts, pass outputs through a third-party punctuation restoration library like deepmultilingualpunctuation.
Most punctuation libraries only support a small subset of the 1600+ languages supported by Omnilingual ASR.

Performance Optimization

  • CTC models: Fastest, best for throughput
  • LLM models: Better accuracy, language conditioning
  • Zero-shot: For unseen languages only
# Start with small batch and increase
batch_sizes = [1, 2, 4, 8, 16]

# Monitor GPU memory and adjust
pipeline = ASRInferencePipeline(
    model_card="omniASR_CTC_3B_v2",
    device="cuda:0"
)
transcriptions = pipeline.transcribe(audio_files, batch_size=8)
# Explicit GPU selection
pipeline = ASRInferencePipeline(
    model_card="omniASR_CTC_1B_v2",
    device="cuda:0"  # or "cpu", "cuda:1", etc.
)

Next Steps

Model Architectures

Explore the technical details of W2V, CTC, and LLM model families

Training Guide

Learn how to fine-tune models on your own data

Data Preparation

Prepare datasets for training and evaluation

API Reference

Detailed API documentation for all components

Build docs developers (and LLMs) love