Quick Start

Get up and running with Omnilingual ASR to transcribe your first audio file in just a few steps.

Prerequisites

Omnilingual ASR installed (see Installation)

Python 3.10 or higher

Audio file ready for transcription

Your First Transcription

Install Omnilingual ASR

Install the package using pip or uv:

pip install omnilingual-asr

Create a Python Script

Create a new file transcribe.py with the following code:

transcribe.py

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Initialize the pipeline with a model
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_Unlimited_7B_v2")

# Transcribe an audio file
audio_files = ["/path/to/your/audio.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=1)

# Print the result
print(f"Transcription: {transcriptions[0]}")

Run the Script

Execute your script:

python transcribe.py

The model will be automatically downloaded on first use and cached for future runs.

The first run will download the model (~30 GiB for the 7B model). Subsequent runs will use the cached model from ~/.cache/fairseq2/assets/.

Choose Your Model

Different models offer different trade-offs between speed, accuracy, and features:

CTC Models (Fast)
LLM Models (Accurate)
Zero-Shot (Few Examples)

Best for: High-throughput batch processing

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Fast parallel generation
pipeline = ASRInferencePipeline(model_card="omniASR_CTC_1B_v2")

audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=2)

Speed: 16x to 96x faster than real-time
VRAM: 2-15 GiB depending on model size
Limitation: No language conditioning

Best for: Maximum accuracy with language conditioning

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Language-conditioned transcription
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B_v2")

audio_files = ["/path/to/english.wav", "/path/to/spanish.wav"]
languages = ["eng_Latn", "spa_Latn"]

transcriptions = pipeline.transcribe(
    audio_files, 
    lang=languages,
    batch_size=2
)

Speed: ~1x real-time
VRAM: 5-17 GiB depending on model size
Feature: Optional language conditioning

Providing language codes improves transcription quality. See Supported Languages for the full list.

Best for: New languages with limited data

from omnilingual_asr.models.inference.pipeline import (
    ASRInferencePipeline,
    ContextExample
)

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B_ZS")

# Provide 1-10 context examples
context = [
    ContextExample("/path/to/example1.wav", "Example text one"),
    ContextExample("/path/to/example2.wav", "Example text two"),
    ContextExample("/path/to/example3.wav", "Example text three")
]

transcriptions = pipeline.transcribe_with_context(
    ["/path/to/test.wav"],
    context_examples=[context],
    batch_size=1
)

Speed: ~0.5x real-time
VRAM: ~20 GiB
Feature: In-context learning with examples

Audio Input Formats

The pipeline accepts multiple audio input formats:

# Most common: provide file paths
audio_files = [
    "/path/to/audio1.flac",
    "/path/to/audio2.wav",
    "/path/to/audio3.mp3"
]
transcriptions = pipeline.transcribe(audio_files, batch_size=3)

Audio Length Constraint: Currently, only audio files shorter than 40 seconds are accepted for CTC and standard LLM models. Use omniASR_LLM_Unlimited_* models for longer audio.

Model Size Comparison

Choose a model size based on your available resources:

Model Size	Parameters	VRAM (CTC)	VRAM (LLM)	Speed (CTC)	Speed (LLM)
300M	317-1,627M	~2 GiB	~5 GiB	96x RT	~1x RT
1B	965-2,275M	~3 GiB	~6 GiB	48x RT	~1x RT
3B	3,064-4,376M	~8 GiB	~10 GiB	32x RT	~1x RT
7B	6,488-7,801M	~15 GiB	~17 GiB	16x RT	~1x RT

RT = Real-Time. “96x RT” means the model processes audio 96 times faster than real-time.

Complete Example

Here’s a complete working example that transcribes multiple audio files with language conditioning:

complete_example.py

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Initialize pipeline
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_1B_v2")

# Prepare audio files and languages
audio_files = [
    "/path/to/english_speech.wav",
    "/path/to/french_speech.wav",
    "/path/to/mandarin_speech.wav",
]

languages = [
    "eng_Latn",  # English (Latin script)
    "fra_Latn",  # French (Latin script)
    "cmn_Hans",  # Mandarin Chinese (Simplified)
]

# Transcribe with language conditioning
transcriptions = pipeline.transcribe(
    audio_files,
    lang=languages,
    batch_size=2
)

# Print results
for audio, lang, text in zip(audio_files, languages, transcriptions):
    print(f"\nFile: {audio}")
    print(f"Language: {lang}")
    print(f"Transcription: {text}")

Next Steps

Explore Models

Learn about all available model variants and their specifications

Advanced Inference

Explore batch processing, context examples, and optimization

Language Support

Browse the full list of 1600+ supported languages

Training Guide

Fine-tune models on your own data

Troubleshooting

Model download is slow

Models are large (1.2 GiB to 30 GiB). The first download may take time depending on your internet connection. Models are cached in ~/.cache/fairseq2/assets/ for future use.

Out of memory errors

Try a smaller model (300M or 1B instead of 3B or 7B), reduce batch size to 1, or use a GPU with more VRAM.

libsndfile errors

Install the system dependency:

macOS: brew install libsndfile
Ubuntu/Debian: sudo apt-get install libsndfile1
Windows: See the fairseq2 installation guide

Audio is longer than 40 seconds

Use the unlimited-length models: omniASR_LLM_Unlimited_{300M,1B,3B,7B}_v2 which support audio of any length.

Get Started

Guides

Models

Advanced

Prerequisites

Your First Transcription

Choose Your Model

Audio Input Formats

Model Size Comparison

Complete Example

Next Steps

Explore Models

Advanced Inference

Language Support

Training Guide

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Guides

Models

Advanced

​Prerequisites

​Your First Transcription

​Choose Your Model

​Audio Input Formats

​Model Size Comparison

​Complete Example

​Next Steps

Explore Models

Advanced Inference

Language Support

Training Guide

​Troubleshooting

Build docs developers (and LLMs) love

Prerequisites

Your First Transcription

Choose Your Model

Audio Input Formats

Model Size Comparison

Complete Example

Next Steps

Troubleshooting