Overview
ASRInferencePipeline provides a high-level interface for performing speech-to-text transcription using Omnilingual ASR models. It handles audio preprocessing, model inference, and beam search decoding.
Constructor
Model card name to load from the hub (e.g.,
"omniASR_LLM_7B"). Mutually exclusive with model and tokenizer parameters. Recommended for inference.Pre-loaded model instance. Mutually exclusive with
model_card. Must be provided together with tokenizer.Pre-loaded tokenizer instance. Mutually exclusive with
model_card. Must be provided together with model.Device to run inference on. Defaults to
"cuda" if available, otherwise "cpu".Data type for model inference.
Optional beam search configuration. If not provided, uses default configuration with
nbest=1 and length_norm=False.Example
Methods
transcribe
Audio input in one of the following formats:
List[Path | str]: Audio file pathsList[bytes]: Raw audio dataList[np.ndarray]: Audio data as uint8 numpy arraysList[dict]: Pre-decoded audio with'waveform'and'sample_rate'keys
Language codes for input audios (e.g.,
'eng_Latn', 'fra_Latn'). Must be the same length as inp. Ignored for CTC models. For LLM models, providing language codes improves transcription quality.Number of audio samples to process in each batch.
Transcribed texts for each input audio.
Example
Maximum audio length is capped at 40 seconds per sample. For longer audio, use the streaming model variant.
transcribe_with_context
omniASR_LLM_7B_ZS model.
Audio input (same formats as
transcribe method).A list of context examples for each input audio. Each inner list contains audio-text pairs demonstrating the transcription style/language. At least one context example is required per input. If fewer than 10 examples are provided, they are replicated. If more than 10 are provided, only the first 10 are used.
Number of audio samples to process in each batch.
Transcribed texts for each input audio.
Example
Raises
ValueError: If bothmodel_cardandmodel/tokenizerare provided, or if only one ofmodel/tokenizeris provided.ValueError: If audio exceeds 40 seconds (non-streaming models).NotImplementedError: Iftranscribe_with_context()is called on non-zero-shot models, or iftranscribe()is called on zero-shot models.
Source Reference
See implementation atsrc/omnilingual_asr/models/inference/pipeline.py:148