Skip to main content
Get up and running with Omnilingual ASR to transcribe your first audio file in just a few steps.

Prerequisites

Omnilingual ASR installed (see Installation)
Python 3.10 or higher
Audio file ready for transcription

Your First Transcription

1

Install Omnilingual ASR

Install the package using pip or uv:
pip install omnilingual-asr
2

Create a Python Script

Create a new file transcribe.py with the following code:
transcribe.py
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Initialize the pipeline with a model
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_Unlimited_7B_v2")

# Transcribe an audio file
audio_files = ["/path/to/your/audio.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=1)

# Print the result
print(f"Transcription: {transcriptions[0]}")
3

Run the Script

Execute your script:
python transcribe.py
The model will be automatically downloaded on first use and cached for future runs.
The first run will download the model (~30 GiB for the 7B model). Subsequent runs will use the cached model from ~/.cache/fairseq2/assets/.

Choose Your Model

Different models offer different trade-offs between speed, accuracy, and features:
Best for: High-throughput batch processing
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Fast parallel generation
pipeline = ASRInferencePipeline(model_card="omniASR_CTC_1B_v2")

audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=2)
  • Speed: 16x to 96x faster than real-time
  • VRAM: 2-15 GiB depending on model size
  • Limitation: No language conditioning

Audio Input Formats

The pipeline accepts multiple audio input formats:
# Most common: provide file paths
audio_files = [
    "/path/to/audio1.flac",
    "/path/to/audio2.wav",
    "/path/to/audio3.mp3"
]
transcriptions = pipeline.transcribe(audio_files, batch_size=3)
Audio Length Constraint: Currently, only audio files shorter than 40 seconds are accepted for CTC and standard LLM models. Use omniASR_LLM_Unlimited_* models for longer audio.

Model Size Comparison

Choose a model size based on your available resources:
Model SizeParametersVRAM (CTC)VRAM (LLM)Speed (CTC)Speed (LLM)
300M317-1,627M~2 GiB~5 GiB96x RT~1x RT
1B965-2,275M~3 GiB~6 GiB48x RT~1x RT
3B3,064-4,376M~8 GiB~10 GiB32x RT~1x RT
7B6,488-7,801M~15 GiB~17 GiB16x RT~1x RT
RT = Real-Time. “96x RT” means the model processes audio 96 times faster than real-time.

Complete Example

Here’s a complete working example that transcribes multiple audio files with language conditioning:
complete_example.py
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Initialize pipeline
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_1B_v2")

# Prepare audio files and languages
audio_files = [
    "/path/to/english_speech.wav",
    "/path/to/french_speech.wav",
    "/path/to/mandarin_speech.wav",
]

languages = [
    "eng_Latn",  # English (Latin script)
    "fra_Latn",  # French (Latin script)
    "cmn_Hans",  # Mandarin Chinese (Simplified)
]

# Transcribe with language conditioning
transcriptions = pipeline.transcribe(
    audio_files,
    lang=languages,
    batch_size=2
)

# Print results
for audio, lang, text in zip(audio_files, languages, transcriptions):
    print(f"\nFile: {audio}")
    print(f"Language: {lang}")
    print(f"Transcription: {text}")

Next Steps

Explore Models

Learn about all available model variants and their specifications

Advanced Inference

Explore batch processing, context examples, and optimization

Language Support

Browse the full list of 1600+ supported languages

Training Guide

Fine-tune models on your own data

Troubleshooting

Models are large (1.2 GiB to 30 GiB). The first download may take time depending on your internet connection. Models are cached in ~/.cache/fairseq2/assets/ for future use.
Try a smaller model (300M or 1B instead of 3B or 7B), reduce batch size to 1, or use a GPU with more VRAM.
Install the system dependency:
Use the unlimited-length models: omniASR_LLM_Unlimited_{300M,1B,3B,7B}_v2 which support audio of any length.

Build docs developers (and LLMs) love