Overview
Omnilingual ASR uses a modular data pipeline architecture separating storage (how data is read) from task (how data is processed). This enables flexible mixing of different storage backends with preprocessing pipelines.MixtureParquetStorage
Parquet-based storage implementation with partition weighting and multilingual sampling.Constructor
Path to parquet dataset directory with language/corpus partitions.
Storage configuration including fragment streaming, loading, and weighting parameters.
Configuration: MixtureParquetStorageConfig
Controls how parquet fragments (row groups) are streamed:
seed: Random seed for shufflingfragment_shuffle_window: Window size for fragment shuffling (-1 for global)nb_epochs: Number of epochs (None for infinite)
Controls how fragments are loaded into memory:
columns: Schema mapping (useLangASRSchema)nb_prefetch: Number of fragments to prefetchnum_parallel_fragments: Parallel loading threadscache: Enable caching of decoded audio
Path to TSV file with corpus/language hour distribution for weighted sampling.
Beta parameter for corpus weighting:
weight = (hours/total)^beta.Beta parameter for language weighting within corpus.
Synchronization mode for distributed training:
UNTIL_FIRST: Stop when first worker finishes (training)UNTIL_LAST: Stop when last worker finishes (validation)
Whether to synchronize batch sizes across workers.
Methods
create_raw_data_pipeline
Split name (e.g., “train”, “dev”, “test”). Can include corpus filter: “train_librispeech”.
Gang configuration for distributed reading.
Pipeline builder yielding dictionaries with audio bytes, text, language, and corpus.
Example
AsrTask
ASR preprocessing pipeline including audio filtering, tokenization, and batching.Constructor
Task configuration for preprocessing pipeline.
Configuration: AsrTaskConfig
Audio Processing
Minimum audio sequence length (in samples). Shorter audio is filtered out.
Maximum audio sequence length (~50s at 16kHz). Longer audio is filtered out.
Whether to normalize audio to zero mean and unit variance.
Whether to use filterbank features instead of raw waveforms.
SpecAugment
Probability of applying SpecAugment per sample.
Maximum frequency mask length for SpecAugment.
Maximum time mask length for SpecAugment.
Text Processing
Maximum text length in tokens. Longer sequences are filtered out.
Whether to remove unknown tokens from text in-place.
Minimum audio samples per character. Samples with faster speech are filtered out.
Batching
Batching strategy:
LENGTH: Dynamic batching by total elements (recommended)STATIC: Fixed batch size
Batch size for
STATIC batching strategy.Maximum total elements per batch for
LENGTH strategy.Batch size must be multiple of this value (for hardware optimization).
Whether to drop last incomplete batch.
Pipeline Settings
Sliding window size for shuffling examples before batching.
Sliding window size for shuffling batches.
Number of batches to prefetch in background.
Number of parallel calls for data pipeline operations.
Methods
apply_processing_pipeline
Input pipeline builder (typically from storage layer).
Gang configuration.
Tokenizer for text encoding.
Data type for audio tensors.
Pipeline builder yielding
Seq2SeqBatch objects.Pipeline Stages
The ASR task pipeline processes data in the following order:- Audio Filtering: Filter by length (
min_audio_len,max_audio_len) - Example Shuffling: Shuffle before batching (
example_shuffle_window) - Text Tokenization: Encode text with tokenizer
- Text Filtering: Filter empty text, unknown sequences, long text
- Batching: Bucket by audio length or static batch size
- Batch Shuffling: Shuffle batches (
batch_shuffle_window) - Audio Decoding: Decode audio bytes to waveforms
- Audio Processing: Normalize, convert to mono, optionally apply SpecAugment
- Feature Extraction: Extract fbank features (if
use_fbank=True) - Collation: Collate into padded batches
- Prefetching: Prefetch batches in background
- Seq2SeqBatch Conversion: Convert to final batch format
Example
Audio Preprocessing
Audio Decoding
Audio bytes are decoded using fairseq2’sAudioDecoder:
Normalization
Mono Conversion
SpecAugment
Applied with probabilityspec_aug_p:
- Convert waveform to spectrogram
- Apply frequency masking (random mask of length up to
spec_aug_freq_mask_param) - Apply time masking (random mask of length up to
spec_aug_time_mask_param) - Convert back to waveform
Filterbank Features
Ifuse_fbank=True:
Schema Definitions
LangASRSchema
Column mapping for parquet datasets:Complete Example: Training Pipeline
Source References
- MixtureParquetStorage:
src/omnilingual_asr/datasets/storage/mixture_parquet_storage.py:133 - AsrTask:
src/omnilingual_asr/datasets/tasks/asr_task.py:140 - Audio utilities:
src/omnilingual_asr/datasets/utils/audio.py