Skip to main content
Comprehensive guide to converting, formatting, and optimizing audio datasets for efficient multilingual ASR training.

Overview

The data preparation pipeline converts popular HuggingFace audio datasets into a standardized parquet format optimized for massively multilingual speech model training.
Parquet datasets support efficient streaming, weighted sampling, and partitioning by corpus, split, and language.

Installation

1

Install Core Dependencies

# Using pip
pip install omnilingual-asr[data]

# Or using uv
uv add "omnilingual-asr[data]"
This installs: pyarrow, polars, pandas
2

Install HuggingFace Dependencies

For data preparation with HuggingFace datasets:
pip install ray datasets
  • ray: Distributed processing framework
  • datasets: HuggingFace datasets library

Parquet Dataset Format

Directory Structure

Datasets are organized by corpus, split, and language:
dataset_root_dir/version=0/
├── corpus=mls/
│   ├── split=train/
│   │   ├── language=deu_Latn/
│   │   │   └── part-*.parquet
│   │   └── language=fra_Latn/
│   │       └── part-*.parquet
│   └── split=dev/
│       └── ...
└── corpus=fleurs/
    └── ...
Benefits:

Filtered Loading

Load only split=train for training

Weighted Sampling

Sample across corpus-language combinations

Language-Specific Eval

Evaluate on specific languages

Schema

Each parquet file follows a minimal, optimized schema:
text: string

audio_bytes: list<element: int8>
  child 0, element: int8

audio_size: int64

corpus: dictionary<values=string, indices=int32, ordered=0>

split: dictionary<values=string, indices=int32, ordered=0>

language: dictionary<values=string, indices=int32, ordered=0>
Type: stringContains the normalized text transcription of the audio sample.Normalization includes:
  • Lowercasing (configurable)
  • Punctuation removal
  • Number word removal
  • Language-specific processing
Parquet files are written with row_group_size=100 to reduce memory footprint during streaming and enable efficient shuffling.

Data Processing Pipeline

Supported Datasets

The example pipeline demonstrates preparation using:
Few-shot Learning Evaluation of Universal Representations of Speech

Processing Steps

The pipeline uses Ray for distributed processing with configurable shuffling:
1

Load from HuggingFace

from datasets import load_dataset

dataset = load_dataset(
    "google/fleurs",
    "en_us",
    split="train",
    streaming=True
)
2

Text Processing

Operations:
  • Language-specific normalization
  • Punctuation removal
  • Lowercase conversion
  • Digit-only word removal
  • Language code remapping
from workflows.dataprep.text_tools import text_normalize

normalized = text_normalize(
    text,
    iso_code="en",
    lower_case=True,
    remove_numbers=True
)
3

Audio Processing

Operations:
  • Binary conversion to byte lists
  • Validation and resampling to 16kHz
  • Audio size computation
  • Format standardization
from workflows.dataprep.audio_tools import AudioTableProcessor

processor = AudioTableProcessor()
processed_batch = processor(audio_batch)
4

Write Parquet

Configuration:
  • Partition by corpus/split/language
  • Row group size: 100
  • Shuffle window: 1k-10k samples
df.write_parquet(
    path,
    partition_by=["corpus", "split", "language"],
    row_group_size=100
)

Quick Start Example

Use the provided ingestion script for automatic dataset generation:
# Process 2 languages (en_us, fr_fr) from FLEURS
python workflows/dataprep/hf_dataset_ingestion_example.py \
  run_short /path/to/output/dir

Individual Dataset Processing

python workflows/dataprep/hf_dataset_ingestion_example.py \
  ingest_mls /path/to/output/dir

Processing Utilities

Text Processing

File: workflows/dataprep/text_tools.py
from workflows.dataprep.text_tools import text_normalize

normalized_text = text_normalize(
    text="Hello, World! 123",
    iso_code="en",           # 2-letter ISO code
    lower_case=True,         # Convert to lowercase
    remove_numbers=True,     # Remove digit-only words
    remove_brackets=False    # Keep bracketed content
)
# Output: "hello world"
Parameters:
  • text: Input text to normalize
  • iso_code: 2-letter ISO language code (e.g., “en”, “de”, “fr”)
  • lower_case: Apply lowercasing (default: True)
  • remove_numbers: Remove words containing only digits (default: True)
  • remove_brackets: Remove bracketed content (default: False)
Features:
  • Language-specific punctuation handling
  • Unicode normalization
  • Whitespace normalization

Audio Processing

File: workflows/dataprep/audio_tools.py
Main class for processing audio data in PyArrow tables:
from workflows.dataprep.audio_tools import AudioTableProcessor

processor = AudioTableProcessor(
    target_sample_rate=16000,
    audio_format="flac"
)

# Process PyArrow table batch
processed_table = processor(audio_table)
Features:
  • Automatic resampling to target rate
  • Format conversion (WAV/FLAC/OGG)
  • Mono-channel conversion
  • Binary encoding to int8 lists
Transform batches to target parquet schema:
from workflows.dataprep.audio_tools import map_to_target_schema

schema_batch = map_to_target_schema(
    batch=processed_batch,
    split="train",
    corpus="fleurs"
)
Adds corpus, split columns and ensures schema compliance.
Efficiently convert PyArrow BinaryArray to ListArray of int8:
from workflows.dataprep.audio_tools import binary_to_list_int8

# Convert audio bytes
list_array = binary_to_list_int8(binary_array)
Performance: Zero-copy conversion for fast processing.
Convert numpy array of audio bytes to waveform tensor:
from workflows.dataprep.audio_tools import bytes_to_tensor

waveform = bytes_to_tensor(
    audio_arr=audio_bytes,
    target_sample_rate=16_000
)
Output: Tensor with shape [num_samples]

Example: Custom Dataset Preparation

From HuggingFace Dataset

import ray
from datasets import load_dataset
from workflows.dataprep.audio_tools import AudioTableProcessor, map_to_target_schema
from workflows.dataprep.text_tools import text_normalize

# Initialize Ray
ray.init()

# Load dataset
ds = load_dataset(
    "mozilla-foundation/common_voice_11_0",
    "en",
    split="train"
)

# Convert to Ray dataset
ray_ds = ray.data.from_huggingface(ds)

# Define processing pipeline
def process_batch(batch):
    # Text normalization
    batch["text"] = [
        text_normalize(text, iso_code="en")
        for text in batch["sentence"]
    ]
    
    # Audio processing
    audio_processor = AudioTableProcessor()
    batch = audio_processor(batch)
    
    # Map to target schema
    batch = map_to_target_schema(
        batch,
        split="train",
        corpus="common_voice"
    )
    
    # Add language column
    batch["language"] = ["eng_Latn"] * len(batch)
    
    return batch

# Process and write
ray_ds.map_batches(process_batch).write_parquet(
    "/path/to/output/version=0",
    partition_cols=["corpus", "split", "language"],
    row_group_size=100
)

From Local Audio Files

import pandas as pd
import pyarrow as pa
from pathlib import Path
from workflows.dataprep.audio_tools import AudioTableProcessor
from workflows.dataprep.text_tools import text_normalize

# Prepare data
audio_files = list(Path("/path/to/audio").glob("*.wav"))
transcripts = [...]  # Load corresponding transcripts

# Create DataFrame
df = pd.DataFrame({
    "audio_path": audio_files,
    "transcript": transcripts
})

# Process
processor = AudioTableProcessor()

def process_row(row):
    # Read audio
    with open(row["audio_path"], "rb") as f:
        audio_bytes = list(f.read())
    
    # Normalize text
    text = text_normalize(row["transcript"], iso_code="en")
    
    return {
        "audio_bytes": audio_bytes,
        "audio_size": len(audio_bytes) * 16000,  # Approximate
        "text": text,
        "corpus": "my_corpus",
        "split": "train",
        "language": "eng_Latn"
    }

processed_df = df.apply(process_row, axis=1)

# Write parquet
table = pa.Table.from_pandas(processed_df)
pa.parquet.write_to_dataset(
    table,
    root_path="/path/to/output/version=0",
    partition_cols=["corpus", "split", "language"],
    row_group_size=100
)

Dataset Loading for Training

Using MixtureParquetAsrDataset

After preparing the dataset, load it for training:
from omnilingual_asr.datasets.impl.mixture_parquet_asr_dataset import MixtureParquetAsrDataset
from fairseq2.models.tokenizers.hub import load_tokenizer

# Create dataset
dataset = MixtureParquetAsrDataset.from_path(
    path="/path/to/dataset/version=0",
    name="my_asr_dataset"
)

# Load tokenizer
tokenizer = load_tokenizer("omniASR_tokenizer_v1")

# Create reader
reader = dataset.create_reader(
    split="train",
    tokenizer=tokenizer,
    gangs=gangs,
    dtype=torch.float32,
    storage_config=storage_config,
    task_config=task_config,
)

# Iterate through batches
for batches in reader:
    for batch in batches:
        # batch.source_seqs: audio features [batch, time, features]
        # batch.target_seqs: text tokens [batch, seq_len]
        # batch.source_seq_lens: audio sequence lengths
        # batch.target_seq_lens: text sequence lengths
        train_step(batch)

Dataloader Features

Weighted Sampling

Sample across languages/corpora with temperature control

Streaming

Efficient streaming with configurable buffering

Audio Processing

Automatic decoding, normalization, feature extraction

Dynamic Batching

Length-based batching for efficient GPU utilization

Text Tokenization

Automatic tokenization with filtering

SpecAugment

Built-in spectrum augmentation

Verification

CLI Verification Tool

Verify dataset creation and data loading:
python -m workflows.dataprep.dataloader_example \
  --dataset_path="root_ds/all_asr/version=0" \
  --split="train" \
  --num_iterations=10
This will:
  1. Load the tokenizer
  2. Create the dataset (scan files, organize by corpus/language)
  3. Create a data reader with task/storage configs
  4. Iterate through batches and show statistics

Programmatic Verification

import pyarrow.parquet as pq

# Load dataset
ds = pq.ParquetDataset("/path/to/dataset/version=0")

# Check partitions
print("Partitions:", ds.partitions)
print("Files:", len(ds.files))

# Read sample
table = ds.read()
print("Schema:", table.schema)
print("Num rows:", len(table))
print("Sample:", table.to_pandas().head())

# Verify statistics
stats = pd.read_csv("/path/to/language_distribution_0.tsv", sep="\t")
print("\nLanguage distribution:")
print(stats)

Integration with Training

To use your prepared dataset in training recipes:
1

Create Asset Card

Define dataset at src/omnilingual_asr/cards/datasets/my_dataset.yaml:
name: my_dataset
dataset_family: mixture_parquet_asr_dataset
dataset_config:
  data: /path/to/the/dataset/version=0
tokenizer_ref: omniASR_tokenizer_v1
2

Reference in Recipe Config

Update your training YAML:
dataset:
  name: "my_dataset"  # Matches asset card
  train_split: "train"
  valid_split: "dev"
  storage_mode: "MIXTURE_PARQUET"
  task_mode: "ASR"
  mixture_parquet_storage_config:
    dataset_summary_path: "/path/to/dataset/language_distribution_0.tsv"
    beta_corpus: 0.5
    beta_language: 0.5
3

Run Training

export OUTPUT_DIR="/path/to/output"
python -m workflows.recipes.wav2vec2.asr $OUTPUT_DIR \
  --config-file your_config.yaml

Performance Optimization

Recommendation: 1k-10k samples
  • Too small: Poor randomization
  • Too large: High memory usage
  • Files consumed in row groups of 100
Default: 100 rows per groupBenefits:
  • Lower memory during streaming
  • Efficient shuffling
  • Faster partition filtering
Always partition by: corpus, split, languageEnables:
  • Fast train/dev/test splitting
  • Language-specific loading
  • Weighted corpus sampling
Use Ray for large datasets:
ray.init(num_cpus=16)  # Adjust based on hardware

ray_ds.map_batches(
    process_fn,
    batch_size=100,
    num_cpus=2  # CPUs per task
).write_parquet(...)

Citations

If using this pipeline or supported datasets, please cite:
@article{conneau2022fleurs,
  title={FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author={Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  journal={arXiv preprint arXiv:2205.12446},
  year={2022}
}

@article{pratap2020mls,
  title={MLS: A Large-Scale Multilingual Dataset for Speech Research},
  author={Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and Synnaeve, Gabriel and Collobert, Ronan},
  journal={arXiv preprint arXiv:2012.03411},
  year={2020}
}

Next Steps

Training Guide

Use your prepared dataset for model training

Inference Guide

Test your data with pre-trained models

GitHub Examples

Explore more data preparation examples

HuggingFace Datasets

Browse available ASR datasets

Build docs developers (and LLMs) love