Overview
Omnilingual ASR supports multiple audio input formats for flexible integration. All formats are automatically preprocessed (decoded, resampled to 16kHz, converted to mono, and normalized) by the inference pipeline.
AudioInput is a type alias representing a list of audio samples in one of several supported formats:
AudioInput = (
List[Path]
| List[str]
| List[str | Path]
| List[bytes]
| List[NDArray[np.int8]]
| List[bytes | NDArray[np.int8]]
| List[Dict[str, Any]]
)
1. File Paths (str or Path)
Provide paths to audio files on disk.
from pathlib import Path
from omnilingual_asr.models.inference import ASRInferencePipeline
pipeline = ASRInferencePipeline("omniASR_LLM_7B")
# Using strings
audio_files = ["audio1.wav", "audio2.mp3", "/path/to/audio3.flac"]
transcriptions = pipeline.transcribe(audio_files)
# Using Path objects
audio_files = [
Path("audio1.wav"),
Path("/absolute/path/audio2.wav")
]
transcriptions = pipeline.transcribe(audio_files)
Supports common audio formats: WAV, MP3, FLAC, OGG, M4A, etc.
2. Raw Audio Bytes
Provide audio data as raw bytes (e.g., from HTTP requests or in-memory buffers).
import requests
pipeline = ASRInferencePipeline("omniASR_LLM_7B")
# From HTTP request
response = requests.get("https://example.com/audio.wav")
audio_bytes = [response.content]
transcriptions = pipeline.transcribe(audio_bytes)
# From file reading
with open("audio.wav", "rb") as f:
audio_data = f.read()
transcriptions = pipeline.transcribe([audio_data])
3. NumPy Arrays
Provide audio data as NumPy arrays (uint8 or int8 dtype).
import numpy as np
pipeline = ASRInferencePipeline("omniASR_LLM_7B")
# From encoded audio bytes
audio_array = np.frombuffer(audio_bytes, dtype=np.uint8)
transcriptions = pipeline.transcribe([audio_array])
Only uint8 and int8 dtypes are supported for NumPy arrays. Other dtypes will raise an assertion error.
4. Pre-decoded Audio Dictionaries
Provide already-decoded audio with waveform and sample rate. This is the most efficient format if you’ve already decoded the audio.
import torch
import torchaudio
pipeline = ASRInferencePipeline("omniASR_LLM_7B")
# Load and decode audio externally
waveform, sample_rate = torchaudio.load("audio.wav")
# Convert to dictionary format
audio_dict = {
"waveform": waveform,
"sample_rate": sample_rate
}
transcriptions = pipeline.transcribe([audio_dict])
Dictionary Format Requirements:
Audio waveform as a PyTorch tensor. Can be 1D (mono) or 2D (multi-channel). Multi-channel audio is automatically converted to mono.
Sample rate of the audio in Hz. Audio is automatically resampled to 16kHz if needed.
Audio Preprocessing Pipeline
Regardless of input format, all audio goes through the following preprocessing:
- Decoding: Audio bytes/files are decoded to waveforms (skipped for pre-decoded format)
- Resampling: Waveforms are resampled to 16kHz
- Mono Conversion: Multi-channel audio is converted to mono by averaging channels
- Normalization: Audio is normalized to zero mean and unit variance
- Length Validation: Non-streaming models enforce a 40-second maximum length
Length Constraints
Non-Streaming Models
Maximum audio length in seconds for non-streaming models.
# This will raise ValueError
long_audio = ["60_second_audio.wav"] # Too long!
transcriptions = pipeline.transcribe(long_audio)
# ValueError: Max audio length is capped at 40s
Streaming Models
Streaming model variants (e.g., omniASR_LLM_7B_Unlimited) can handle audio of any length by processing it in segments.
You can mix different input formats in a single batch:
from pathlib import Path
pipeline = ASRInferencePipeline("omniASR_LLM_7B")
# Mix paths and pre-decoded audio
mixed_input = [
"audio1.wav", # File path as string
Path("audio2.wav"), # File path as Path
{"waveform": waveform, "sample_rate": 16000} # Pre-decoded
]
transcriptions = pipeline.transcribe(mixed_input)
All elements in the input list must be of compatible types. Don’t mix bytes with dictionaries, or file paths with numpy arrays in the same batch.
Best Practices
- Pre-decoded format: Use pre-decoded dictionaries if you need to decode audio multiple times
- Batch size: Larger batches improve throughput but require more memory
- File paths: Most convenient but requires disk I/O
- Bytes: Good for streaming/HTTP scenarios
Batch Processing Example
import glob
from pathlib import Path
pipeline = ASRInferencePipeline("omniASR_LLM_7B")
# Process large directory in batches
audio_files = [Path(f) for f in glob.glob("audio_dir/*.wav")]
batch_size = 8
all_transcriptions = []
for i in range(0, len(audio_files), batch_size):
batch = audio_files[i:i+batch_size]
transcriptions = pipeline.transcribe(batch, batch_size=batch_size)
all_transcriptions.extend(transcriptions)
print(f"Processed {len(all_transcriptions)} files")
Error Handling
try:
transcriptions = pipeline.transcribe(audio_input)
except ValueError as e:
if "Max audio length" in str(e):
print("Audio too long for non-streaming model")
elif "Unsupported input type" in str(e):
print("Invalid audio format provided")
except AssertionError as e:
if "Only uint8 numpy arrays" in str(e):
print("Numpy array must be uint8 or int8 dtype")
Source Reference
See type definition at src/omnilingual_asr/models/inference/pipeline.py:51