Overview
Wav2Vec2LlamaModel combines a Wav2Vec2 encoder with a Llama decoder for automatic speech recognition. It supports three model variants:
- LLM_ASR: Standard encoder-decoder ASR
- LLM_ASR_LID: ASR with language identification conditioning
- ZERO_SHOT: Zero-shot learning with context examples
Constructor
Core Parameters
Model variant:
ModelType.LLM_ASR: Standard ASRModelType.LLM_ASR_LID: ASR with language IDModelType.ZERO_SHOT: Zero-shot with context
Model dimension of the transformer decoder.
Wav2Vec2 encoder frontend for feature extraction.
Wav2Vec2 encoder.
Projection layer from encoder outputs to decoder dimension.
Text token embedding module.
Llama decoder-only model.
Final projection layer from decoder to vocabulary logits.
Vocabulary information including size and special token indices.
Optional Parameters
Feature masker for Wav2Vec2 (used during training).
Maximum length of generated sequences in decoder.
Number of encoder frames to stack before feeding to decoder (for compression).
Probability of using language embeddings (for LID model). Dropout probability during training.
Name of the batch metadata field containing language information.
Language embedding module (required for LID model).
Mapping from language codes to embedding indices.
Whether to use text-only context (instead of audio+text).
Beam search configuration for decoding.
Streaming configuration for >30s audio.
Text encoder for streaming mode.
Number of context examples for zero-shot model.
Random seed for reproducibility.
Models are typically loaded using
load_model("omniASR_LLM_7B") rather than constructed directly.Forward Pass
Input batch containing source audio and target text.
Whether to return logits along with loss (for debugging).
Whether to return decoder inputs for beam search (inference mode).
Return Values
Model Architectures
Standard LLM-ASR
Input syntax:Zero-Shot Model
Input syntax:Streaming Model
Input syntax:<regular_segment>: For intermediate segments<last_segment>: For final segment
Embedding Methods
embed_audio
Audio waveforms [batch_size, time].
Actual sequence lengths.
Embedded audio [batch_size, reduced_time, model_dim].
Reduced sequence lengths after encoder.
embed_text
Text token indices [batch_size, seq_len].
Target dtype for embeddings.
Text embeddings [batch_size, seq_len, model_dim].
Training Example
Inference Example
Model Variants
| Model Card | Type | Parameters | Features |
|---|---|---|---|
omniASR_LLM_300M | LLM_ASR_LID | 300M | Language conditioning |
omniASR_LLM_1B | LLM_ASR_LID | 1B | Language conditioning |
omniASR_LLM_3B | LLM_ASR_LID | 3B | Language conditioning |
omniASR_LLM_7B | LLM_ASR_LID | 7B | Language conditioning |
omniASR_LLM_7B_ZS | ZERO_SHOT | 7B | Zero-shot learning |
omniASR_LLM_7B_Unlimited | LLM_ASR_LID | 7B | Streaming (unlimited length) |
See Also
- ASRInferencePipeline - High-level inference API
- Wav2Vec2LlamaConfig - Model configuration
- Wav2Vec2LlamaBeamSearchConfig - Beam search settings
Source Reference
See implementation atsrc/omnilingual_asr/models/wav2vec2_llama/model.py:43