Overview
Efficient batch processing is critical for training and inference with ASR models. Omnilingual ASR provides two batching strategies and various optimization techniques to maximize throughput while managing memory constraints.Batching Strategies
The framework supports two batching strategies defined in/src/omnilingual_asr/datasets/utils/batching.py:13:
Static Batching
Fixed number of sequences per batch, regardless of sequence length.- Each batch contains exactly
batch_sizeexamples - Simpler to reason about for debugging
- Can lead to memory spikes with long sequences
- Less efficient GPU utilization
- Small datasets with uniform audio lengths
- Debugging and development
- Inference with fixed batch sizes
Length-Based Batching (Recommended)
Dynamic batching where each batch has a maximum number of elements (audio samples).- Batches contain variable numbers of sequences
- Total elements per batch ≤
max_num_elements - Number of sequences is a multiple of
num_seqs_multiple_of - More efficient memory usage
- Better GPU utilization
- Training on diverse datasets
- Production workloads
- Memory-constrained environments
Configuration Parameters
max_num_elements
max_num_elements
Maximum total audio samples across all sequences in a batch.Type: Integer
Default: 3,200,000
Applies to: LENGTH batchingExample calculation:
Default: 3,200,000
Applies to: LENGTH batchingExample calculation:
- Audio 1: 400,000 samples
- Audio 2: 300,000 samples
- Audio 3: 260,000 samples
- Total: 960,000 ≤ max_num_elements ✓
If
max_num_elements % max_audio_len != 0, it will be rounded down automatically (see /src/omnilingual_asr/datasets/utils/batching.py:40-42).num_seqs_multiple_of
num_seqs_multiple_of
Forces batch size to be a multiple of this value for hardware optimization.Type: Integer
Default: 8
Applies to: LENGTH batchingWhy it matters:
Default: 8
Applies to: LENGTH batchingWhy it matters:
- Modern GPUs perform best with batch sizes that are multiples of 8 or 16
- Enables efficient tensor core utilization on NVIDIA GPUs
- Aligns with FSDP (Fully Sharded Data Parallel) requirements
max_bucket_size
max_bucket_size
Limits the maximum number of sequences in any single bucket.Type: Integer or None
Default: None
Applies to: LENGTH batchingBehavior:Implementation: See
Default: None
Applies to: LENGTH batchingBehavior:
- Filters out buckets with more than
max_bucket_sizesequences - Prioritizes shorter sequences (fairseq2 buckets shortest sequences first)
- Useful for preventing very small batches with long sequences
/src/omnilingual_asr/datasets/utils/batching.py:50-56drop_remainder
drop_remainder
Whether to drop the last incomplete batch.Type: Boolean
Default: False
Applies to: Both STATIC and LENGTHUse cases:
Default: False
Applies to: Both STATIC and LENGTHUse cases:
- Set to
Truefor distributed training to ensure all workers have equal batches - Set to
Falsefor inference to process all data
Memory Optimization
Audio Length Filtering
Filter audio by length before batching to optimize memory usage:- Removes very short clips that don’t benefit training
- Prevents OOM errors from extremely long sequences
- Improves bucketing efficiency
/src/omnilingual_asr/datasets/tasks/asr_task.py:163-168
Bucket Size Calculation
The framework usesfairseq2.data.data_pipeline.create_bucket_sizes to automatically calculate optimal bucket sizes:
bucket_size * seq_len ≤ max_num_elementsbucket_size % num_seqs_multiple_of == 0
Example Configuration
Parallel Processing
Multi-Partition Loading
For training with mixture parquet datasets, partitions can be loaded in parallel:- Each language-corpus partition is loaded by a separate thread
- Partitions are sampled according to their weights
- Examples are prefetched in background
/src/omnilingual_asr/datasets/storage/mixture_parquet_storage.py:486-545
Fragment Prefetching
Prefetch parquet fragments in the background:Batch Prefetching
Prefetch processed batches while the model trains:- Higher values improve throughput but increase memory usage
- Recommended: 2-4 for training, 1 for inference
Shuffling Strategies
Example-Level Shuffling
Shuffle individual examples before batching:example_shuffle_window: 1- No shufflingexample_shuffle_window: 0- Load entire dataset and shuffle (not recommended - OOM risk)example_shuffle_window: N- Shuffle within sliding window of N examples
Batch-Level Shuffling
Shuffle batches after bucketing:- Increases diversity of sequence lengths within an epoch
- Prevents model from learning length-based patterns
- Improves gradient stability
/src/omnilingual_asr/datasets/tasks/asr_task.py:336-346
Data Pipeline Architecture
The complete pipeline flow for LENGTH batching:Performance Tips
Optimize for Your Hardware
Optimize for Your Hardware
Adjust
max_num_elements based on available GPU memory:Use Length Batching for Training
Use Length Batching for Training
Always use
BatchingStrategy.LENGTH for training - it provides better memory efficiency and GPU utilization.Set num_seqs_multiple_of Correctly
Set num_seqs_multiple_of Correctly
- 8: Good default for most GPUs
- 16: Better for A100 with tensor cores
- Must be ≤ your minimum expected batch size
Monitor Batch Statistics
Monitor Batch Statistics
During training, log batch sizes to ensure efficient bucketing:
Adjust Prefetch Conservatively
Adjust Prefetch Conservatively
Start with low prefetch values and increase gradually: