Skip to main content
Comprehensive guide to training CTC and LLM models using fairseq2 recipes with real-world configurations and best practices.

Overview

Omnilingual ASR provides pre-configured fairseq2 recipes that combine models, datasets, and hyperparameters into reproducible training workflows.
Recipes are training workflows you can run with a single command. They handle data loading, model initialization, optimization, and checkpointing automatically.
Available Recipes:

Training Recipe

wav2vec2.asr.recipeSupports CTC and LLM training from scratch or fine-tuning from checkpoints.

Evaluation Recipe

wav2vec2.asr.eval.recipeTest model performance on evaluation datasets.

Quick Start

1

Set Output Directory

Configure where training artifacts (checkpoints, logs) will be saved:
export OUTPUT_DIR="/path/to/artifact/directory"
2

Navigate to Repository

cd omnilingual_asr
3

Run Training Recipe

python -m workflows.recipes.wav2vec2.asr $OUTPUT_DIR \
  --config-file workflows/recipes/wav2vec2/asr/configs/ctc-finetune.yaml
The recipe will automatically handle distributed training, checkpointing, validation, and metrics logging.

Training Configurations

Four main configuration files are provided for different training scenarios:
Train a CTC model starting from a pre-trained W2V encoder checkpoint.Config: configs/ctc-from-encoder.yaml
model:
  name: "omniASR_CTC_300M"

dataset:
  name: "example_dataset"
  train_split: "train"
  valid_split: "dev"
  storage_mode: "MIXTURE_PARQUET"
  task_mode: "ASR"
  mixture_parquet_storage_config:
    dataset_summary_path: "/path/to/dataset/language_distribution_0.tsv"
    beta_corpus: 0.5
    beta_language: 0.5
    fragment_loading:
      cache: True
  asr_task_config:
     min_audio_len: 32_000
     max_audio_len: 960_000
     max_num_elements: 960_000
     batch_shuffle_window: 1
     normalize_audio: true
     example_shuffle_window: 1

tokenizer:
  name: "omniASR_tokenizer_v1"

optimizer:
  config:
    lr: 5e-05

trainer:
  freeze_encoder_for_n_steps: 0
  mixed_precision:
    dtype: "torch.bfloat16"
  grad_accumulation:
    num_batches: 4

regime:
  num_steps: 20_000
  validate_after_n_steps: 0
  validate_every_n_steps: 1000
  checkpoint_every_n_steps: 1000
  publish_metrics_every_n_steps: 200
Use Case: Building a new CTC model with custom data
For users with limited compute resources who want to fine-tune smaller CTC models on low-resource languages:
Based on Section 5.7.5 of the research paper, fine-tuning smaller CTC models produces models competitive with 7B LLM models on specific languages.

Optimal Configuration

Config: configs/ctc-finetune-recommendation.yaml
model:
  name: "omniASR_CTC_300M"  # or omniASR_CTC_1B, omniASR_CTC_3B

dataset:
  name: "example_dataset"
  train_split: "train"
  valid_split: "dev"
  storage_mode: "MIXTURE_PARQUET"
  task_mode: "ASR"
  mixture_parquet_storage_config:
    dataset_summary_path: "/path/to/your/dataset/language_distribution_0.tsv"
    beta_corpus: 0.5
    beta_language: 0.5
    fragment_loading:
      cache: True
  asr_task_config:
     max_audio_len: 960_000       # 60s at 16kHz
     max_num_elements: 7_680_000  # Max of 8x 60s samples or more at lower lengths
     batch_shuffle_window: 1
     normalize_audio: true
     example_shuffle_window: 0    # Full-batch shuffling

tokenizer:
  name: "omniASR_tokenizer_v1"

optimizer:
  config:
    lr: 1e-05  # Lower learning rate for fine-tuning

trainer:
  freeze_encoder_for_n_steps: 0
  mixed_precision:
    dtype: "torch.bfloat16"
  grad_accumulation:
    num_batches: 1  # Tune if getting OOM errors

regime:
  num_steps: 5_000  # Shorter training for fine-tuning
  validate_every_n_steps: 500
  validate_after_n_steps: 500
  checkpoint_every_n_steps: 500
  checkpoint_after_n_steps: 500
  publish_metrics_every_n_steps: 500
  publish_metrics_after_n_steps: 500
Audio Length Settings:
  • max_audio_len: 960_000 - Maximum 60 seconds at 16kHz
  • max_num_elements: 7_680_000 - Allows 8x 60s samples or more shorter samples
Optimization:
  • lr: 1e-05 - Lower learning rate prevents catastrophic forgetting
  • grad_accumulation: 1 - Increase if running out of memory
Training Regime:
  • num_steps: 5_000 - Sufficient for fine-tuning on specific languages
  • More steps may be needed for larger datasets
Reference configurations from paper:
  • 300M model: 32 GPUs
  • 1B model: 64 GPUs
  • 3B model: 96 GPUs
For smaller setups:
  • Increase grad_accumulation.num_batches to compensate for fewer GPUs
  • Reduce max_num_elements to fit in available memory
  • Use gradient checkpointing for memory savings
These hyperparameters served as a good starting point across languages, but optimal settings vary per language. Experiment with learning rate and number of steps.

Configuration Parameters

Dataset Configuration

dataset:
  name: "example_dataset"
  train_split: "train"
  valid_split: "dev"
  storage_mode: "MIXTURE_PARQUET"
  task_mode: "ASR"
  mixture_parquet_storage_config:
    dataset_summary_path: "/path/to/language_distribution_0.tsv"
    beta_corpus: 0.5      # Temperature for corpus sampling
    beta_language: 0.5    # Temperature for language sampling
    fragment_loading:
      cache: True
  asr_task_config:
     min_audio_len: 32_000       # Min 2s at 16kHz
     max_audio_len: 960_000      # Max 60s at 16kHz
     max_num_elements: 960_000   # Batch size limit
     batch_shuffle_window: 1
     normalize_audio: true
     example_shuffle_window: 1
Key Parameters:
Temperature Sampling:
  • beta_corpus: Controls corpus mixture distribution
    • 0.5: Balanced sampling
    • Higher: More uniform across corpora
    • Lower: Biased toward larger corpora
  • beta_language: Controls language mixture distribution
    • 0.5: Balanced across languages
    • Higher: More uniform sampling
    • Lower: Biased toward high-resource languages
  • min_audio_len: Minimum audio length (in samples at 16kHz)
    • Example: 32_000 = 2 seconds
  • max_audio_len: Maximum audio length
    • Example: 960_000 = 60 seconds
  • max_num_elements: Maximum total samples in a batch
    • Controls memory usage
    • Can be multiple short samples or fewer long samples
  • batch_shuffle_window: Number of batches to shuffle
    • 1: Minimal shuffling
    • Higher: More randomization
  • example_shuffle_window: Number of examples to shuffle
    • 0: Full-batch shuffling
    • 1: Minimal shuffling
    • Higher: Window-based shuffling

Trainer Configuration

trainer:
  freeze_encoder_for_n_steps: 0
  mixed_precision:
    dtype: "torch.bfloat16"
  grad_accumulation:
    num_batches: 4
Parameters:
  • freeze_encoder_for_n_steps: Freeze encoder during initial training steps
    • Useful when training decoder from scratch
    • 0: No freezing
  • mixed_precision.dtype: Use mixed precision training
    • torch.bfloat16: Better numerical stability than fp16
    • Reduces memory and increases speed
  • grad_accumulation.num_batches: Accumulate gradients over N batches
    • Effectively increases batch size without more memory
    • Increase if running out of GPU memory
  • data_parallelism (LLM only): Distributed training strategy
    • fsdp: Fully Sharded Data Parallel for large models
    • Required for 1B+ parameter models

Optimizer Configuration

optimizer:
  config:
    lr: 5e-05  # Learning rate
    # Additional Adam parameters available
Start with 5e-05 for training from scratch and 1e-05 for fine-tuning.

Regime Configuration

regime:
  num_steps: 20_000                     # Total training steps
  validate_after_n_steps: 0             # When to start validation
  validate_every_n_steps: 1000          # Validation frequency
  checkpoint_every_n_steps: 1000        # Checkpoint frequency
  checkpoint_after_n_steps: 0           # When to start checkpointing
  publish_metrics_every_n_steps: 200    # Metrics logging frequency
  publish_metrics_after_n_steps: 0      # When to start logging

Dataset Backends

The recipe system supports multiple storage and task backends:

Storage Backends

Implementation: MixtureParquetStorageOptimized for large-scale multilingual training with weighted sampling:
from omnilingual_asr.datasets.storage.mixture_parquet_storage import MixtureParquetStorage

storage = MixtureParquetStorage(
    dataset_path="/path/to/parquet/dataset",
    dataset_summary_path="/path/to/language_distribution_0.tsv",
    beta_corpus=0.5,
    beta_language=0.5
)
Features:
  • Temperature-based sampling across corpora and languages
  • Efficient streaming from partitioned parquet files
  • Built-in statistics tracking
See: mixture_parquet_storage.py

Task Backends

Implementation: AsrTaskReturns Seq2SeqBatch with audio and text:
# Output format
batch = Seq2SeqBatch(
    source_seqs=audio_features,      # [batch, time, features]
    source_seq_lens=audio_lengths,   # [batch]
    target_seqs=text_tokens,         # [batch, seq_len]
    target_seq_lens=text_lengths     # [batch]
)
See: asr_task.py

Recipe Structure

workflows/recipes/wav2vec2/asr/
├── eval/
│   ├── configs/              # Evaluation recipe configs
│   ├── default_config.py
│   └── recipe.py             # Evaluation logic
├── configs/                  # Training recipe configs
│   ├── ctc-from-encoder.yaml
│   ├── ctc-finetune.yaml
│   ├── ctc-finetune-recommendation.yaml
│   ├── llm-from-encoder.yaml
│   └── llm-finetune.yaml
├── criterion.py              # Loss computation
├── dataset_selector.py       # Backend switching logic
├── default_config.py         # Default parameters
├── recipe.py                 # Main training logic
└── wer_calculator.py         # WER metric computation

Running Evaluation

Test model performance using the evaluation recipe:
python -m workflows.recipes.wav2vec2.asr.eval $OUTPUT_DIR \
  --config-file workflows/recipes/wav2vec2/asr/eval/configs/eval.yaml
The evaluation recipe reuses dataset configurations from training recipes and generates:
  • Hypothesis transcriptions
  • WER/CER metrics
  • Per-language performance statistics

Custom Dataset Integration

To use your own dataset with the training recipes:
1

Prepare Dataset

Convert your data to parquet format following the Data Preparation Guide:
python -m workflows.dataprep.hf_dataset_ingestion_example run_full /path/to/output
2

Create Asset Card

Define a dataset asset card at src/omnilingual_asr/cards/datasets/my_dataset.yaml:
name: my_dataset
dataset_family: mixture_parquet_asr_dataset
dataset_config:
  data: /path/to/the/dataset
tokenizer_ref: omniASR_tokenizer_v1
3

Reference in Config

Update your recipe YAML to reference the new dataset:
dataset:
  name: "my_dataset"  # Matches asset card name
  train_split: "train"
  valid_split: "dev"
  # ... rest of config
4

Run Training

python -m workflows.recipes.wav2vec2.asr $OUTPUT_DIR \
  --config-file your_custom_config.yaml

Monitoring Training

Training metrics are logged at intervals specified in the regime configuration:
# View metrics in TensorBoard
tensorboard --logdir=$OUTPUT_DIR/tensorboard

Best Practices

  1. Test with 300M model first
  2. Verify data pipeline works correctly
  3. Scale to larger models once validated
  4. Use gradient accumulation to simulate larger batch sizes
  1. Use lower learning rate (1e-05 vs 5e-05)
  2. Shorter training (5K vs 20K steps)
  3. Monitor validation loss for overfitting
  4. Consider freezing encoder initially
  1. Start with small max_num_elements
  2. Increase grad_accumulation if OOM
  3. Use mixed precision (bfloat16)
  4. Enable gradient checkpointing for large models
  1. Use FSDP for models > 1B parameters
  2. Adjust GPU count based on model size
  3. Scale learning rate with batch size
  4. Monitor communication overhead

Troubleshooting

Solutions:
  • Reduce max_num_elements
  • Increase grad_accumulation.num_batches
  • Reduce max_audio_len
  • Enable gradient checkpointing
  • Use smaller model variant
Solutions:
  • Lower learning rate
  • Increase warmup steps
  • Check data quality and normalization
  • Reduce batch size
  • Verify gradient clipping is enabled
Solutions:
  • Increase batch_size or max_num_elements
  • Reduce validation frequency
  • Enable data caching (fragment_loading.cache: True)
  • Use faster storage (local SSD vs network)
  • Profile data loading bottlenecks

Next Steps

Data Preparation

Learn how to prepare datasets for training

Inference Guide

Use your trained models for transcription

Model Architectures

Understand the model internals

GitHub Examples

Explore more training examples

Build docs developers (and LLMs) love