Learn how to transcribe audio with our multilingual ASR models, from quick start to advanced usage patterns.
Quick Start
Get started with Omnilingual ASR inference in just a few lines:
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
pipeline = ASRInferencePipeline( model_card = "omniASR_CTC_1B_v2" )
transcriptions = pipeline.transcribe([ "/path/to/audio1.flac" ], batch_size = 1 )
print (transcriptions[ 0 ])
The models were trained on audio durations of 30 seconds or less. We recommend keeping samples under 30 seconds for optimal performance.
Currently only audio files shorter than 40 seconds are accepted for inference.
The inference pipeline accepts multiple input formats through the AudioInput type:
File Paths
Binary Data
Decoded Audio
The simplest approach - provide paths to audio files: audio_files = [ "/path/to/audio.wav" , "/path/to/audio.flac" ]
transcriptions = pipeline.transcribe(audio_files, batch_size = 2 )
Supported formats: .wav, .flac Pass encoded audio binary data directly in memory: # From file handle
audio_bytes = [ open ( "audio.wav" , "rb" ).read()]
# From numpy array (int8)
audio_array = [numpy_audio_array]
transcriptions = pipeline.transcribe(audio_bytes, batch_size = 1 )
Provide pre-decoded audio waveforms: audio_dict = [{
"waveform" : tensor, # Audio tensor
"sample_rate" : 16000 # Sample rate in Hz
}]
transcriptions = pipeline.transcribe(audio_dict, batch_size = 1 )
Audio Preprocessing
All audio inputs undergo automatic preprocessing:
Decode
Encoded audio (.wav/.flac) is decoded to raw waveforms
Resample
Audio is resampled to 16kHz for model compatibility
Convert to Mono
Multi-channel audio is converted to mono-channel
Normalize
Waveforms are normalized before model ingestion
We recommend keeping this preprocessing pipeline similar when integrating the model in downstream applications.
Batch Processing
Process multiple audio files efficiently with batching:
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
pipeline = ASRInferencePipeline( model_card = "omniASR_CTC_1B_v2" )
audio_files = [ "/path/to/audio1.flac" , "/path/to/audio2.wav" ]
transcriptions = pipeline.transcribe(audio_files, batch_size = 2 )
for file , trans in zip (audio_files, transcriptions):
print ( f " { file } : { trans } " )
Adjust batch_size based on your GPU memory. Larger batches increase throughput but require more memory.
Model Types
Omnilingual ASR offers three model families, each optimized for different use cases:
CTC Models
Parallel generation models optimized for speed and throughput.
Basic Usage
Available Models
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
pipeline = ASRInferencePipeline(
model_card = "omniASR_CTC_3B_v2" ,
device = None
)
audio_files = [ "/path/to/audio1.flac" , "/path/to/audio2.wav" ]
transcriptions = pipeline.transcribe(audio_files, batch_size = 2 )
for file_path, text in zip (audio_files, transcriptions):
print ( f "CTC transcription - { file_path } : { text } " )
Key Features:
Fastest inference with parallel generation
No language conditioning support
No context example support
Ideal for on-device transcription
When to Use:
High-throughput scenarios
Real-time transcription needs
Resource-constrained environments
Single-language applications
LLM Models
Autoregressive models with language conditioning for enhanced accuracy.
With Language Codes
Without Language Codes
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
pipeline = ASRInferencePipeline( model_card = "omniASR_LLM_Unlimited_1B_v2" )
audio_files = [
"/path/to/russian_audio.wav" ,
"/path/to/english_audio.flac" ,
"/path/to/german_audio.wav"
]
transcriptions = pipeline.transcribe(
audio_files,
lang = [ "rus_Cyrl" , "eng_Latn" , "deu_Latn" ],
batch_size = 3
)
Available Variants:
Standard LLM+LID
Unlimited Length
Language-conditioned models with optional language identification:
omniASR_LLM_300M_v2
omniASR_LLM_1B_v2
omniASR_LLM_3B_v2
omniASR_LLM_7B_v2
Trained with 80/20 split of samples with/without language IDs. Extended models for transcribing unlimited-length audio:
omniASR_LLM_Unlimited_300M_v2
omniASR_LLM_Unlimited_1B_v2
omniASR_LLM_Unlimited_3B_v2
omniASR_LLM_Unlimited_7B_v2
Uses 15-second segments with context from previous segment (M=1).
Zero-Shot Models
In-context learning models for unseen languages using audio-text example pairs.
from omnilingual_asr.models.inference.pipeline import (
ASRInferencePipeline,
ContextExample
)
pipeline = ASRInferencePipeline( model_card = "omniASR_LLM_7B_ZS" )
# Provide 1-10 context examples
context_examples = [
ContextExample( "/path/to/context_audio1.wav" , "Hello world" ),
ContextExample( "/path/to/context_audio2.wav" , "How are you today" ),
ContextExample( "/path/to/context_audio3.flac" , "Nice to meet you" )
]
transcriptions = pipeline.transcribe_with_context(
[ "/path/to/test_audio.wav" ],
context_examples = [context_examples],
batch_size = 1
)
print ( f "Transcription: { transcriptions[ 0 ] } " )
The model uses exactly 10 context slots internally. If fewer than 10 examples are provided, samples are duplicated sequentially to fill all slots. If more than 10 are provided, they’re cropped.
Context samples should be up to 30 seconds in length. The model supports a maximum audio length of 60 seconds but performs suboptimally with longer samples.
When to Use:
Transcribing rare or low-resource languages
Languages not in the training set
Domain-specific vocabulary or accents
Few-shot learning scenarios
Advanced Usage
Use training-format parquet datasets directly for inference:
import pyarrow.parquet as pq
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
ds = pq.ParquetDataset( "/path/to/dataset/" )
batch_data = ds._dataset.head( 10 ).to_pandas() # First 10 samples
audio_bytes = batch_data[ "audio_bytes" ].tolist()
pipeline = ASRInferencePipeline( model_card = "omniASR_LLM_1B_v2" )
transcriptions = pipeline.transcribe(audio_bytes, batch_size = 4 )
for i, text in enumerate (transcriptions):
print ( f "Sample { i + 1 } : { text } " )
See the Data Preparation Guide for parquet schema details.
HuggingFace Datasets
Integrate with HuggingFace datasets seamlessly:
from datasets import load_dataset
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
# Load dataset
omni_dataset = load_dataset(
"facebook/omnilingual-asr-corpus" ,
"lij_Latn" ,
split = "train" ,
streaming = True
)
batch = next (omni_dataset.iter( 5 ))
# Convert to pipeline format
audio_data = [
{ "waveform" : x[ "array" ], "sample_rate" : x[ "sampling_rate" ]}
for x in batch[ "audio" ]
]
pipeline = ASRInferencePipeline( model_card = "omniASR_LLM_1B_v2" )
transcriptions = pipeline.transcribe(audio_data, batch_size = 2 )
for i, text in enumerate (transcriptions):
print ( f "Sample { i + 1 } : { text } " )
For advanced users integrating models with fairseq2’s Seq2SeqBatch interface:
The behavior of .forward() varies depending on batch structure. Extra fields in batch.example may change model interpretation.
from fairseq2.datasets.batch import Seq2SeqBatch
batch = Seq2SeqBatch(
source_seqs = audio_tensor, # [BS, T_audio, D_audio] - target audio
source_seq_lens = audio_lengths, # [BS] - actual audio lengths
target_seqs = text_tensor, # [BS, T_text] - target text tokens
target_seq_lens = text_lengths, # [BS] - actual text lengths
example = {} # Empty dict - no special fields
)
batch = Seq2SeqBatch(
source_seqs = audio_tensor, # [BS, T_audio, D_audio] - target audio
source_seq_lens = audio_lengths, # [BS] - actual audio lengths
target_seqs = text_tensor, # [BS, T_text] - target text tokens
target_seq_lens = text_lengths, # [BS] - actual text lengths
example = {
"lang" : [ 'mxs_Latn' , ... ] # [BS] - language codes per sample
}
)
Language codes must be from lang_ids.py .
batch = Seq2SeqBatch(
source_seqs = audio_tensor, # [BS, T_audio, D_audio] - target audio
source_seq_lens = audio_lengths, # [BS] - actual audio lengths
target_seqs = text_tensor, # [BS, T_text] - target text tokens
target_seq_lens = text_lengths, # [BS] - actual text lengths
example = {
"context_audio" : [ # List[Dict] - BS context examples
{ "seqs" : context_audio_1, "seq_lens" : [audio_len_1]},
# ... more context audio
{ "seqs" : context_audio_BS, "seq_lens" : [audio_len_BS]},
],
"context_text" : [ # List[Dict] - BS context text
{ "seqs" : context_text_1, "seq_lens" : [text_len_1]},
# ... more context text
{ "seqs" : context_text_BS, "seq_lens" : [text_len_BS]},
]
}
)
Punctuation and Capitalization
Our models output transcripts in spoken form without punctuation or capitalization.
Most punctuation libraries only support a small subset of the 1600+ languages supported by Omnilingual ASR.
CTC models : Fastest, best for throughput
LLM models : Better accuracy, language conditioning
Zero-shot : For unseen languages only
# Start with small batch and increase
batch_sizes = [ 1 , 2 , 4 , 8 , 16 ]
# Monitor GPU memory and adjust
pipeline = ASRInferencePipeline(
model_card = "omniASR_CTC_3B_v2" ,
device = "cuda:0"
)
transcriptions = pipeline.transcribe(audio_files, batch_size = 8 )
# Explicit GPU selection
pipeline = ASRInferencePipeline(
model_card = "omniASR_CTC_1B_v2" ,
device = "cuda:0" # or "cpu", "cuda:1", etc.
)
Next Steps
Model Architectures Explore the technical details of W2V, CTC, and LLM model families
Training Guide Learn how to fine-tune models on your own data
Data Preparation Prepare datasets for training and evaluation
API Reference Detailed API documentation for all components