Language conditioning allows LLM-based ASR models (wav2vec2_llama) to leverage language information during decoding, improving transcription accuracy. CTC models ignore language conditioning as they perform direct frame-level classification.
Language conditioning is only supported by LLM-based models (omniASR_LLM_*). CTC models (omniASR_CTC_*) ignore the language parameter.
Omnilingual ASR supports 1,682 languages with their script variants. The complete list is defined in /src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py:9-1682.
You can omit language IDs, but quality may degrade:
# No language conditioningtranscriptions = pipeline.transcribe( inp=audio_files, batch_size=4)# Warning logged: "Using an LLM model without a `lang` code # can lead to degraded transcription quality."
Language conditioning has no effect on CTC models:
# Using CTC modelpipeline = ASRInferencePipeline( model_card="omniASR_CTC_300M_v2" # CTC model)# Language parameter is ignoredtranscriptions = pipeline.transcribe( inp=audios, lang=["eng_Latn"] * len(audios) # ⚠️ Ignored!)# Info logged: "Found lang=... with a CTC model. Ignoring."
Code-Switching Audio
For audio with multiple languages, language conditioning may hurt:
# Audio contains English and Spanish# Don't use language conditioningtranscriptions = pipeline.transcribe( inp=[codeswitched_audio], lang=None # Better without conditioning)
Uncertain Language Labels
If language labels are unreliable, avoid conditioning:
# Uncertain labels from weak classifierif language_confidence < 0.8: lang = Noneelse: lang = detected_languagetranscription = pipeline.transcribe( inp=[audio], lang=[lang])
During training, language IDs are automatically included from dataset metadata:
# Dataset partition structurePartition(lang="eng_Latn", corpus="librispeech")Partition(lang="fra_Latn", corpus="common_voice")# Language info flows through the pipelinebatch.example["lang"] = ["eng_Latn", "eng_Latn", "fra_Latn"]
from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langs# Check if language is supportedlang_id = "eng_Latn"if lang_id in supported_langs: print(f"{lang_id} is supported")else: print(f"{lang_id} is NOT supported")# List all supported languagesprint(f"Total languages: {len(supported_langs)}")for lang in supported_langs[:10]: print(lang)
languages = [ "eng_Latn" if is_english(audio) else None for audio in audios]
Validate Language IDs
Check against supported languages before use:
from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langsdef safe_get_lang(detected_lang): if detected_lang in supported_langs: return detected_lang print(f"Warning: {detected_lang} not supported") return None
Batch by Language When Possible
Group same-language audio for potentially better batching:
# Group by languagefrom collections import defaultdictby_lang = defaultdict(list)for audio, lang in zip(audios, languages): by_lang[lang].append(audio)# Process each language groupfor lang, lang_audios in by_lang.items(): transcriptions = pipeline.transcribe( inp=lang_audios, lang=[lang] * len(lang_audios), batch_size=8 )
# Check if supportedfrom omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langsif your_lang_id not in supported_langs: print(f"{your_lang_id} not supported") # Find similar similar = [l for l in supported_langs if l.startswith(your_lang_id[:3])] print(f"Similar languages: {similar}")
Wrong language conditioning
If you accidentally use the wrong language, transcription quality degrades. Always validate:
# Add validationassert all(l in supported_langs or l is None for l in languages)
Length mismatch error
AssertionError: `lang` must be a list of the same length as `inp`
Fix:
# Ensure same lengthassert len(audios) == len(languages)# Or use None for alllanguages = [None] * len(audios)