Overview
Omnilingual ASR uses several configuration classes to define model architecture, beam search, and streaming behavior.Wav2Vec2LlamaConfig
Top-level configuration combining Wav2Vec2 encoder and Llama decoder settings.Structure
Wav2Vec2 configuration for encoder frontend and encoder.
Llama configuration for decoder.
Beam search configuration for LLM-ASR decoding.
Streaming configuration for >30s transcriptions.
Number of audio embedding frames to stack before decoder.
Encoder freeze setting:
0: Frozen1: UnfrozenN > 1: Unfrozen every N calls
Probability of dropping language embeddings during training.
Column name containing language information in batch metadata.
Adapts model input syntax for text-only context (instead of audio+text).
Number of additional special tokens to allocate in vocab-embedding mapping.
High-level model type:
ModelType.LLM_ASR: Standard ASRModelType.LLM_ASR_LID: ASR with language IDModelType.ZERO_SHOT: Zero-shot with context
Number of context examples for zero-shot model.
Vocabulary Parameters
Index of UNK token in vocabulary.
Index of BOS (beginning of sequence) token.
Index of EOS (end of sequence) token.
Index of PAD token. Must align with Llama’s
pad_idx.Example
Wav2Vec2LlamaBeamSearchConfig
Configuration for beam search decoding.Parameters
Size of the beam (number of hypotheses to maintain).
Whether to apply length normalization when computing hypothesis scores.
Window size for early stopping detection. If the last N tokens compress at a ratio greater than
compression_threshold, decoding stops.Compression ratio threshold for early stopping (used with
compression_window).Example
Early Stopping
Early stopping prevents infinite loops on bad audio:- During decoding, track the last
compression_windowtokens - Compute compression ratio (e.g., using zlib)
- If ratio >
compression_threshold, stop decoding
Wav2Vec2LlamaStreamingConfig
Configuration for streaming mode (unlimited audio length).Parameters
Enable streaming mode for >30s audio.
Duration of each audio segment in seconds.
Audio sample rate in Hz.
Number of context segments to maintain (in addition to current segment).
Name of text tokenizer for streaming mode.
Minimum audio length in milliseconds. Shorter segments are dropped.
Example
Streaming Processing
- Segmentation: Audio is split into
segment_secschunks - Context: Previous
n_context_segmentsare maintained for continuity - Transcription: Each segment is transcribed with context
- Concatenation: Segment transcriptions are concatenated into final output
Wav2Vec2LlamaSpecialTokens
Special token indices allocated beyond vocabulary size.Properties
All special tokens are allocated asvocab_size + offset:
Default/LID Syntax
Language ID marker token:
vocab_size + 0Streaming Syntax
Streaming language token:
vocab_size + 0Last segment marker:
vocab_size + 1Regular segment marker:
vocab_size + 2Context Syntax (Zero-Shot)
Context start marker:
vocab_size + 0Context end marker:
vocab_size + 1Context example start marker:
vocab_size + 2Context example end marker:
vocab_size + 3Context BOS token:
vocab_size + 4Context EOS token:
vocab_size + 5Example
ModelType Enum
Defines the high-level model variant.Usage
Predefined Configurations
Omnilingual ASR provides predefined configurations:| Config Name | Size | Type | Special Features |
|---|---|---|---|
300m | 300M | LLM_ASR_LID | Language conditioning |
1b | 1B | LLM_ASR_LID | Language conditioning |
3b | 3B | LLM_ASR_LID | Language conditioning |
7b | 7B | LLM_ASR_LID | Language conditioning |
7b_zs | 7B | ZERO_SHOT | Zero-shot, 10 context examples |
7b_unlimited_v2 | 7B | LLM_ASR_LID | Streaming enabled |
Loading Configs
Configuration Validation
Configs are validated in__post_init__:
Complete Example
Source Reference
See implementation atsrc/omnilingual_asr/models/wav2vec2_llama/config.py