Overview
Omnilingual ASR uses themixture_parquet_asr_dataset format for training and evaluation. This guide shows how to create custom datasets, define asset cards, and integrate them into training workflows.
Dataset Architecture
Mixture Parquet Format
The mixture parquet dataset organizes audio data with language and corpus partitioning:Required Schema
Each parquet file must contain these columns (defined in/src/omnilingual_asr/datasets/storage/mixture_parquet_storage.py:42-51):
Creating a Custom Dataset
Step 1: Prepare Your Data
Organize audio files and transcriptions:Step 2: Convert to Parquet
Create parquet files with the required schema:Step 3: Create Language Distribution File
For weighted sampling during training, create a TSV file with language/corpus statistics:Defining the Dataset Asset Card
Create a YAML asset card for your dataset:/src/omnilingual_asr/cards/datasets/my_dataset.yaml
Asset Card Fields
name
name
Unique identifier for loading the dataset.Type: String
Example:
Example:
my_custom_datasetdataset_family
dataset_family
Dataset implementation type. Use
Value:
mixture_parquet_asr_dataset for parquet-based datasets.Type: StringValue:
mixture_parquet_asr_datasetdataset_config.data
dataset_config.data
Path to the dataset directory (should point to
Example:
version=0 directory).Type: PathExample:
/data/datasets/my_dataset/version=0tokenizer_ref
tokenizer_ref
Reference to the tokenizer asset card.Type: String
Example:
Example:
omniASR_tokenizer_written_v2Integrating with Training
Training Configuration
Reference your dataset in a training recipe:configs/custom-training.yaml
Weighted Sampling Configuration
Control how different corpora and languages are sampled:/src/omnilingual_asr/datasets/storage/mixture_parquet_storage.py:338-397):
Advanced Dataset Features
Multiple Splits
Create train, dev, and test splits:Multiple Corpora
Combine multiple corpora in one dataset:Filtering by Corpus
Train on specific corpus using split naming:/src/omnilingual_asr/datasets/storage/mixture_parquet_storage.py:400-429
Partition Filters
Filter specific languages during training:Validation and Testing
Verify Dataset Structure
Test Loading
Best Practices
Audio Format Consistency
Audio Format Consistency
Ensure all audio is:
- 16kHz sample rate
- Mono (single channel)
- 16-bit PCM (standard WAV format)
Partition Size
Partition Size
Keep partition sizes reasonable:
- Target: 1000-10000 examples per parquet file
- Max size: ~500MB per file
- Split large corpora into multiple
part-N.parquetfiles
Text Normalization
Text Normalization
Normalize text before creating the dataset:
Language ID Validation
Language ID Validation
Verify language IDs are supported:
Troubleshooting
No parquet files found for split
No parquet files found for split
- Dataset path is correct in asset card
- Split name matches directory structure:
split=train - Parquet files exist in the partition directories
Schema mismatch errors
Schema mismatch errors
Memory issues during creation
Memory issues during creation
Process large datasets in chunks: