Overview
Omnilingual ASR provides pre-configured fairseq2 recipes that combine models, datasets, and hyperparameters into reproducible training workflows.Recipes are training workflows you can run with a single command. They handle data loading, model initialization, optimization, and checkpointing automatically.
Training Recipe
wav2vec2.asr.recipeSupports CTC and LLM training from scratch or fine-tuning from checkpoints.Evaluation Recipe
wav2vec2.asr.eval.recipeTest model performance on evaluation datasets.Quick Start
Training Configurations
Four main configuration files are provided for different training scenarios:- CTC from Encoder
- CTC Fine-tuning
- LLM from Encoder
- LLM Fine-tuning
Train a CTC model starting from a pre-trained W2V encoder checkpoint.Config: Use Case: Building a new CTC model with custom data
configs/ctc-from-encoder.yamlRecommended Fine-tuning Settings
For users with limited compute resources who want to fine-tune smaller CTC models on low-resource languages:Based on Section 5.7.5 of the research paper, fine-tuning smaller CTC models produces models competitive with 7B LLM models on specific languages.
Optimal Configuration
Config:configs/ctc-finetune-recommendation.yaml
Key Parameters Explained
Key Parameters Explained
Audio Length Settings:
max_audio_len: 960_000- Maximum 60 seconds at 16kHzmax_num_elements: 7_680_000- Allows 8x 60s samples or more shorter samples
lr: 1e-05- Lower learning rate prevents catastrophic forgettinggrad_accumulation: 1- Increase if running out of memory
num_steps: 5_000- Sufficient for fine-tuning on specific languages- More steps may be needed for larger datasets
GPU Requirements
GPU Requirements
Reference configurations from paper:
- 300M model: 32 GPUs
- 1B model: 64 GPUs
- 3B model: 96 GPUs
- Increase
grad_accumulation.num_batchesto compensate for fewer GPUs - Reduce
max_num_elementsto fit in available memory - Use gradient checkpointing for memory savings
Configuration Parameters
Dataset Configuration
Sampling Configuration
Sampling Configuration
Temperature Sampling:
beta_corpus: Controls corpus mixture distribution0.5: Balanced sampling- Higher: More uniform across corpora
- Lower: Biased toward larger corpora
beta_language: Controls language mixture distribution0.5: Balanced across languages- Higher: More uniform sampling
- Lower: Biased toward high-resource languages
Audio Constraints
Audio Constraints
min_audio_len: Minimum audio length (in samples at 16kHz)- Example:
32_000= 2 seconds
- Example:
max_audio_len: Maximum audio length- Example:
960_000= 60 seconds
- Example:
max_num_elements: Maximum total samples in a batch- Controls memory usage
- Can be multiple short samples or fewer long samples
Shuffling
Shuffling
batch_shuffle_window: Number of batches to shuffle1: Minimal shuffling- Higher: More randomization
example_shuffle_window: Number of examples to shuffle0: Full-batch shuffling1: Minimal shuffling- Higher: Window-based shuffling
Trainer Configuration
-
freeze_encoder_for_n_steps: Freeze encoder during initial training steps- Useful when training decoder from scratch
0: No freezing
-
mixed_precision.dtype: Use mixed precision trainingtorch.bfloat16: Better numerical stability than fp16- Reduces memory and increases speed
-
grad_accumulation.num_batches: Accumulate gradients over N batches- Effectively increases batch size without more memory
- Increase if running out of GPU memory
-
data_parallelism(LLM only): Distributed training strategyfsdp: Fully Sharded Data Parallel for large models- Required for 1B+ parameter models
Optimizer Configuration
Regime Configuration
Dataset Backends
The recipe system supports multiple storage and task backends:Storage Backends
- Mixture Parquet
- Manifest Storage
Implementation: Features:
MixtureParquetStorageOptimized for large-scale multilingual training with weighted sampling:- Temperature-based sampling across corpora and languages
- Efficient streaming from partitioned parquet files
- Built-in statistics tracking
Task Backends
- ASR Task
- SSL Task
Recipe Structure
Running Evaluation
Test model performance using the evaluation recipe:- Hypothesis transcriptions
- WER/CER metrics
- Per-language performance statistics
Custom Dataset Integration
To use your own dataset with the training recipes:Prepare Dataset
Convert your data to parquet format following the Data Preparation Guide:
Create Asset Card
Define a dataset asset card at
src/omnilingual_asr/cards/datasets/my_dataset.yaml:Monitoring Training
Training metrics are logged at intervals specified in the regime configuration:Best Practices
Start Small, Scale Up
Start Small, Scale Up
- Test with 300M model first
- Verify data pipeline works correctly
- Scale to larger models once validated
- Use gradient accumulation to simulate larger batch sizes
Fine-tuning Strategy
Fine-tuning Strategy
- Use lower learning rate (1e-05 vs 5e-05)
- Shorter training (5K vs 20K steps)
- Monitor validation loss for overfitting
- Consider freezing encoder initially
Memory Management
Memory Management
- Start with small
max_num_elements - Increase
grad_accumulationif OOM - Use mixed precision (bfloat16)
- Enable gradient checkpointing for large models
Distributed Training
Distributed Training
- Use FSDP for models > 1B parameters
- Adjust GPU count based on model size
- Scale learning rate with batch size
- Monitor communication overhead
Troubleshooting
Out of Memory (OOM)
Out of Memory (OOM)
Solutions:
- Reduce
max_num_elements - Increase
grad_accumulation.num_batches - Reduce
max_audio_len - Enable gradient checkpointing
- Use smaller model variant
Training Diverges
Training Diverges
Solutions:
- Lower learning rate
- Increase warmup steps
- Check data quality and normalization
- Reduce batch size
- Verify gradient clipping is enabled
Slow Training
Slow Training
Solutions:
- Increase
batch_sizeormax_num_elements - Reduce validation frequency
- Enable data caching (
fragment_loading.cache: True) - Use faster storage (local SSD vs network)
- Profile data loading bottlenecks
Next Steps
Data Preparation
Learn how to prepare datasets for training
Inference Guide
Use your trained models for transcription
Model Architectures
Understand the model internals
GitHub Examples
Explore more training examples