Comprehensive guide to converting, formatting, and optimizing audio datasets for efficient multilingual ASR training.
Overview
The data preparation pipeline converts popular HuggingFace audio datasets into a standardized parquet format optimized for massively multilingual speech model training.
Parquet datasets support efficient streaming, weighted sampling, and partitioning by corpus, split, and language.
Installation
Install Core Dependencies
# Using pip
pip install omnilingual-asr[data]
# Or using uv
uv add "omnilingual-asr[data]"
This installs: pyarrow, polars, pandas
Install HuggingFace Dependencies
For data preparation with HuggingFace datasets:
ray: Distributed processing framework
datasets: HuggingFace datasets library
Directory Structure
Datasets are organized by corpus, split, and language:
dataset_root_dir/version=0/
├── corpus=mls/
│ ├── split=train/
│ │ ├── language=deu_Latn/
│ │ │ └── part-*.parquet
│ │ └── language=fra_Latn/
│ │ └── part-*.parquet
│ └── split=dev/
│ └── ...
└── corpus=fleurs/
└── ...
Benefits:
Filtered Loading Load only split=train for training
Weighted Sampling Sample across corpus-language combinations
Language-Specific Eval Evaluate on specific languages
Schema
Each parquet file follows a minimal, optimized schema:
text: string
audio_bytes: list < element: int8 >
child 0 , element: int8
audio_size: int64
corpus: dictionary < values = string, indices = int32, ordered = 0 >
split: dictionary < values = string, indices = int32, ordered = 0 >
language: dictionary < values = string, indices = int32, ordered = 0 >
text
audio_bytes
audio_size
corpus
split
language
Type: stringContains the normalized text transcription of the audio sample. Normalization includes:
Lowercasing (configurable)
Punctuation removal
Number word removal
Language-specific processing
Type: list<int8>Contains compressed (FLAC/OGG) binary audio data as a list of bytes. Format:
All audio converted to 16kHz mono-channel
Uses pa.list_(pa.int8()) instead of pa.binary()
No additional copying when converting to pandas
Use binary_to_list_int8() from audio_tools.py for fast conversion.
Type: int64Size of the decoded audio waveform in samples. Uses:
Filter samples that are too short/long
Create length-matched batches
Compute duration: audio_size / 16_000 seconds
Temperature sampling statistics
Type: dictionary<string>Corpus name where data originates (e.g., “fleurs”, “mls”). Uses:
Partition filtering
Weighted corpus sampling
Dataset mixture tracking
Type: dictionary<string>Dataset split: “train”, “dev”, or “test”. Uses:
Load training vs validation data
Partition filtering during data loading
Type: dictionary<string>Standardized language code (e.g., “deu_Latn” for German). Format: [ISO 639-3 language code]_[Script]See: lang_ids.py
Parquet files are written with row_group_size=100 to reduce memory footprint during streaming and enable efficient shuffling.
Data Processing Pipeline
Supported Datasets
The example pipeline demonstrates preparation using:
Few-shot Learning Evaluation of Universal Representations of Speech
Processing Steps
The pipeline uses Ray for distributed processing with configurable shuffling:
Load from HuggingFace
from datasets import load_dataset
dataset = load_dataset(
"google/fleurs" ,
"en_us" ,
split = "train" ,
streaming = True
)
Text Processing
Operations:
Language-specific normalization
Punctuation removal
Lowercase conversion
Digit-only word removal
Language code remapping
from workflows.dataprep.text_tools import text_normalize
normalized = text_normalize(
text,
iso_code = "en" ,
lower_case = True ,
remove_numbers = True
)
Audio Processing
Operations:
Binary conversion to byte lists
Validation and resampling to 16kHz
Audio size computation
Format standardization
from workflows.dataprep.audio_tools import AudioTableProcessor
processor = AudioTableProcessor()
processed_batch = processor(audio_batch)
Write Parquet
Configuration:
Partition by corpus/split/language
Row group size: 100
Shuffle window: 1k-10k samples
df.write_parquet(
path,
partition_by = [ "corpus" , "split" , "language" ],
row_group_size = 100
)
Quick Start Example
Use the provided ingestion script for automatic dataset generation:
Quick Test (5-10 minutes)
Full Example (~90 minutes)
With Versioning
# Process 2 languages (en_us, fr_fr) from FLEURS
python workflows/dataprep/hf_dataset_ingestion_example.py \
run_short /path/to/output/dir
Individual Dataset Processing
Process MLS Only
Process FLEURS Only
Compute Statistics
python workflows/dataprep/hf_dataset_ingestion_example.py \
ingest_mls /path/to/output/dir
python workflows/dataprep/hf_dataset_ingestion_example.py \
ingest_fleurs /path/to/output/dir
python workflows/dataprep/hf_dataset_ingestion_example.py \
compute_stats \
/path/to/parquet/dataset \
/path/to/output/stats.tsv
Processing Utilities
Text Processing
File: workflows/dataprep/text_tools.py
from workflows.dataprep.text_tools import text_normalize
normalized_text = text_normalize(
text = "Hello, World! 123" ,
iso_code = "en" , # 2-letter ISO code
lower_case = True , # Convert to lowercase
remove_numbers = True , # Remove digit-only words
remove_brackets = False # Keep bracketed content
)
# Output: "hello world"
Parameters:
text: Input text to normalize
iso_code: 2-letter ISO language code (e.g., “en”, “de”, “fr”)
lower_case: Apply lowercasing (default: True)
remove_numbers: Remove words containing only digits (default: True)
remove_brackets: Remove bracketed content (default: False)
Features:
Language-specific punctuation handling
Unicode normalization
Whitespace normalization
Audio Processing
File: workflows/dataprep/audio_tools.py
Main class for processing audio data in PyArrow tables: from workflows.dataprep.audio_tools import AudioTableProcessor
processor = AudioTableProcessor(
target_sample_rate = 16000 ,
audio_format = "flac"
)
# Process PyArrow table batch
processed_table = processor(audio_table)
Features:
Automatic resampling to target rate
Format conversion (WAV/FLAC/OGG)
Mono-channel conversion
Binary encoding to int8 lists
Transform batches to target parquet schema: from workflows.dataprep.audio_tools import map_to_target_schema
schema_batch = map_to_target_schema(
batch = processed_batch,
split = "train" ,
corpus = "fleurs"
)
Adds corpus, split columns and ensures schema compliance.
Efficiently convert PyArrow BinaryArray to ListArray of int8: from workflows.dataprep.audio_tools import binary_to_list_int8
# Convert audio bytes
list_array = binary_to_list_int8(binary_array)
Performance: Zero-copy conversion for fast processing.
Convert numpy array of audio bytes to waveform tensor: from workflows.dataprep.audio_tools import bytes_to_tensor
waveform = bytes_to_tensor(
audio_arr = audio_bytes,
target_sample_rate = 16_000
)
Output: Tensor with shape [num_samples]
Example: Custom Dataset Preparation
From HuggingFace Dataset
import ray
from datasets import load_dataset
from workflows.dataprep.audio_tools import AudioTableProcessor, map_to_target_schema
from workflows.dataprep.text_tools import text_normalize
# Initialize Ray
ray.init()
# Load dataset
ds = load_dataset(
"mozilla-foundation/common_voice_11_0" ,
"en" ,
split = "train"
)
# Convert to Ray dataset
ray_ds = ray.data.from_huggingface(ds)
# Define processing pipeline
def process_batch ( batch ):
# Text normalization
batch[ "text" ] = [
text_normalize(text, iso_code = "en" )
for text in batch[ "sentence" ]
]
# Audio processing
audio_processor = AudioTableProcessor()
batch = audio_processor(batch)
# Map to target schema
batch = map_to_target_schema(
batch,
split = "train" ,
corpus = "common_voice"
)
# Add language column
batch[ "language" ] = [ "eng_Latn" ] * len (batch)
return batch
# Process and write
ray_ds.map_batches(process_batch).write_parquet(
"/path/to/output/version=0" ,
partition_cols = [ "corpus" , "split" , "language" ],
row_group_size = 100
)
From Local Audio Files
import pandas as pd
import pyarrow as pa
from pathlib import Path
from workflows.dataprep.audio_tools import AudioTableProcessor
from workflows.dataprep.text_tools import text_normalize
# Prepare data
audio_files = list (Path( "/path/to/audio" ).glob( "*.wav" ))
transcripts = [ ... ] # Load corresponding transcripts
# Create DataFrame
df = pd.DataFrame({
"audio_path" : audio_files,
"transcript" : transcripts
})
# Process
processor = AudioTableProcessor()
def process_row ( row ):
# Read audio
with open (row[ "audio_path" ], "rb" ) as f:
audio_bytes = list (f.read())
# Normalize text
text = text_normalize(row[ "transcript" ], iso_code = "en" )
return {
"audio_bytes" : audio_bytes,
"audio_size" : len (audio_bytes) * 16000 , # Approximate
"text" : text,
"corpus" : "my_corpus" ,
"split" : "train" ,
"language" : "eng_Latn"
}
processed_df = df.apply(process_row, axis = 1 )
# Write parquet
table = pa.Table.from_pandas(processed_df)
pa.parquet.write_to_dataset(
table,
root_path = "/path/to/output/version=0" ,
partition_cols = [ "corpus" , "split" , "language" ],
row_group_size = 100
)
Dataset Loading for Training
Using MixtureParquetAsrDataset
After preparing the dataset, load it for training:
from omnilingual_asr.datasets.impl.mixture_parquet_asr_dataset import MixtureParquetAsrDataset
from fairseq2.models.tokenizers.hub import load_tokenizer
# Create dataset
dataset = MixtureParquetAsrDataset.from_path(
path = "/path/to/dataset/version=0" ,
name = "my_asr_dataset"
)
# Load tokenizer
tokenizer = load_tokenizer( "omniASR_tokenizer_v1" )
# Create reader
reader = dataset.create_reader(
split = "train" ,
tokenizer = tokenizer,
gangs = gangs,
dtype = torch.float32,
storage_config = storage_config,
task_config = task_config,
)
# Iterate through batches
for batches in reader:
for batch in batches:
# batch.source_seqs: audio features [batch, time, features]
# batch.target_seqs: text tokens [batch, seq_len]
# batch.source_seq_lens: audio sequence lengths
# batch.target_seq_lens: text sequence lengths
train_step(batch)
Dataloader Features
Weighted Sampling Sample across languages/corpora with temperature control
Streaming Efficient streaming with configurable buffering
Audio Processing Automatic decoding, normalization, feature extraction
Dynamic Batching Length-based batching for efficient GPU utilization
Text Tokenization Automatic tokenization with filtering
SpecAugment Built-in spectrum augmentation
Verification
Verify dataset creation and data loading:
python -m workflows.dataprep.dataloader_example \
--dataset_path= "root_ds/all_asr/version=0" \
--split= "train" \
--num_iterations=10
This will:
Load the tokenizer
Create the dataset (scan files, organize by corpus/language)
Create a data reader with task/storage configs
Iterate through batches and show statistics
Programmatic Verification
import pyarrow.parquet as pq
# Load dataset
ds = pq.ParquetDataset( "/path/to/dataset/version=0" )
# Check partitions
print ( "Partitions:" , ds.partitions)
print ( "Files:" , len (ds.files))
# Read sample
table = ds.read()
print ( "Schema:" , table.schema)
print ( "Num rows:" , len (table))
print ( "Sample:" , table.to_pandas().head())
# Verify statistics
stats = pd.read_csv( "/path/to/language_distribution_0.tsv" , sep = " \t " )
print ( " \n Language distribution:" )
print (stats)
Integration with Training
To use your prepared dataset in training recipes:
Create Asset Card
Define dataset at src/omnilingual_asr/cards/datasets/my_dataset.yaml: name : my_dataset
dataset_family : mixture_parquet_asr_dataset
dataset_config :
data : /path/to/the/dataset/version=0
tokenizer_ref : omniASR_tokenizer_v1
Reference in Recipe Config
Update your training YAML: dataset :
name : "my_dataset" # Matches asset card
train_split : "train"
valid_split : "dev"
storage_mode : "MIXTURE_PARQUET"
task_mode : "ASR"
mixture_parquet_storage_config :
dataset_summary_path : "/path/to/dataset/language_distribution_0.tsv"
beta_corpus : 0.5
beta_language : 0.5
Run Training
export OUTPUT_DIR = "/path/to/output"
python -m workflows.recipes.wav2vec2.asr $OUTPUT_DIR \
--config-file your_config.yaml
Recommendation: 1k-10k samples
Too small: Poor randomization
Too large: High memory usage
Files consumed in row groups of 100
Default: 100 rows per groupBenefits:
Lower memory during streaming
Efficient shuffling
Faster partition filtering
Always partition by: corpus, split, languageEnables:
Fast train/dev/test splitting
Language-specific loading
Weighted corpus sampling
Use Ray for large datasets: ray.init( num_cpus = 16 ) # Adjust based on hardware
ray_ds.map_batches(
process_fn,
batch_size = 100 ,
num_cpus = 2 # CPUs per task
).write_parquet( ... )
Citations
If using this pipeline or supported datasets, please cite:
@article { conneau2022fleurs ,
title = { FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech } ,
author = { Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur } ,
journal = { arXiv preprint arXiv:2205.12446 } ,
year = { 2022 }
}
@article { pratap2020mls ,
title = { MLS: A Large-Scale Multilingual Dataset for Speech Research } ,
author = { Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and Synnaeve, Gabriel and Collobert, Ronan } ,
journal = { arXiv preprint arXiv:2012.03411 } ,
year = { 2020 }
}
Next Steps
Training Guide Use your prepared dataset for model training
Inference Guide Test your data with pre-trained models
GitHub Examples Explore more data preparation examples
HuggingFace Datasets Browse available ASR datasets