Data Statistics

Overview

The data statistics utility computes mean and standard deviation values for mel-spectrograms in your dataset. These statistics are essential for proper normalization during training and inference.

Command Line Interface

matcha-data-stats

matcha-data-stats -i <config_file> [-b <batch_size>] [-f]

Arguments

-i, --input-config

str

required

Name of the YAML configuration file under configs/data/Example: ljspeech.yaml, vctk.yaml

-b, --batch-size

int

default:"256"

Batch size for processing. Higher values are faster but use more memory

-f, --force

flag

Force overwrite existing statistics file

Python API

compute_data_statistics()

Compute mean and standard deviation for mel-spectrograms.

from matcha.utils.generate_data_statistics import compute_data_statistics

stats = compute_data_statistics(
    data_loader=train_loader,
    out_channels=80
)

Parameters

data_loader

torch.utils.data.DataLoader

required

DataLoader containing mel-spectrograms in batch[“y”]

out_channels

int

required

Number of mel-spectrogram channels (typically 80)

Returns

mel_mean

float

Mean value across all mel-spectrogram frames and channels

mel_std

float

Standard deviation across all mel-spectrogram frames and channels

Output Format

The command generates a JSON file named <config_name>.json:

{
  "mel_mean": -5.534512,
  "mel_std": 2.123456
}

Usage in Training

The computed statistics are used in the model configuration:

model:
  _target_: matcha.models.matcha_tts.MatchaTTS
  data_statistics:
    mel_mean: -5.534512
    mel_std: 2.123456
  # ... other parameters

Examples

Basic Usage

matcha-data-stats -i ljspeech.yaml

Output:

Dataloader loaded! Now computing stats...
{'mel_mean': -5.534512, 'mel_std': 2.123456}

High-Speed Processing

matcha-data-stats -i vctk.yaml -b 512

Force Overwrite

matcha-data-stats -i ljspeech.yaml -f

Python Script Example

import torch
import json
from pathlib import Path
from matcha.data.text_mel_datamodule import TextMelDataModule
from matcha.utils.generate_data_statistics import compute_data_statistics

# Setup datamodule
datamodule = TextMelDataModule(
    name="ljspeech",
    train_filelist_path="data/ljspeech/train.txt",
    valid_filelist_path="data/ljspeech/val.txt",
    batch_size=256,
    n_feats=80,
    # ... other config
)

datamodule.setup()
train_loader = datamodule.train_dataloader()

# Compute statistics
stats = compute_data_statistics(
    data_loader=train_loader,
    out_channels=80
)

print(f"Mean: {stats['mel_mean']:.6f}")
print(f"Std: {stats['mel_std']:.6f}")

# Save to file
with open("ljspeech_stats.json", "w") as f:
    json.dump(stats, f, indent=2)

Mathematical Details

The statistics are computed as: Mean:

mean = sum(all_mel_values) / (total_frames * n_channels)

Standard Deviation:

std = sqrt(
    sum(mel_values^2) / (total_frames * n_channels) - mean^2
)

Configuration Requirements

Your data config file must include:

name: "dataset_name"
train_filelist_path: "path/to/train.txt"
valid_filelist_path: "path/to/val.txt"
n_feats: 80
batch_size: 256
# ... other datamodule parameters

Filelist Format

The training filelist should contain:

audio_path|text|speaker_id
LJ001-0001.wav|This is the text.|0
LJ001-0002.wav|Another sentence.|0

Performance Considerations

Memory Usage

Batch size 256: ~8 GB GPU memory
Batch size 512: ~16 GB GPU memory
Batch size 128: ~4 GB GPU memory

Processing Time

For LJSpeech (~13,100 samples):

Batch size 256: ~2-3 minutes
Batch size 512: ~1-2 minutes
Batch size 128: ~4-5 minutes

Normalization in Model

The model uses these statistics for normalization:

# During training/inference
def normalize(mel, mean, std):
    return (mel - mean) / std

def denormalize(mel, mean, std):
    return mel * std + mean

Error Handling

File Already Exists

$ matcha-data-stats -i ljspeech.yaml
File already exists. Use -f to force overwrite

Solution: Add -f flag or delete existing file

Config Not Found

$ matcha-data-stats -i nonexistent.yaml
Config file not found: configs/data/nonexistent.yaml

Solution: Check config file name and path

Invalid Batch Size

$ matcha-data-stats -i ljspeech.yaml -b 0
Batch size must be greater than 0

Solution: Use positive integer for batch size

Best Practices

Use training set only: Compute statistics on training data, not validation/test
Consistent preprocessing: Ensure mel-spectrogram extraction matches training
Sufficient data: Use full training set for accurate statistics
Save statistics: Store in version control with model configs
Recompute when needed: Recalculate if you change mel-spectrogram parameters

matcha-tts: Main synthesis command
matcha-tts-get-durations: Extract phoneme durations

Source Reference

Implementation: matcha/utils/generate_data_statistics.py:25 Entry Point: setup.py:45

Models

CLI Commands

Utilities

Overview

Command Line Interface

matcha-data-stats

Arguments

Python API

compute_data_statistics()

Parameters

Returns

Output Format

Usage in Training

Examples

Basic Usage

High-Speed Processing

Force Overwrite

Python Script Example

Mathematical Details

Configuration Requirements

Filelist Format

Performance Considerations

Memory Usage

Processing Time

Normalization in Model

Error Handling

File Already Exists

Config Not Found

Invalid Batch Size

Best Practices

Source Reference

Build docs developers (and LLMs) love

Models

CLI Commands

Utilities

​Overview

​Command Line Interface

​matcha-data-stats

​Arguments

​Python API

​compute_data_statistics()

​Parameters

​Returns

​Output Format

​Usage in Training

​Examples

​Basic Usage

​High-Speed Processing

​Force Overwrite

​Python Script Example

​Mathematical Details

​Configuration Requirements

​Filelist Format

​Performance Considerations

​Memory Usage

​Processing Time

​Normalization in Model

​Error Handling

​File Already Exists

​Config Not Found

​Invalid Batch Size

​Best Practices

​Related Commands

​Source Reference

Build docs developers (and LLMs) love

Overview

Command Line Interface

matcha-data-stats

Arguments

Python API

compute_data_statistics()

Parameters

Returns

Output Format

Usage in Training

Examples

Basic Usage

High-Speed Processing

Force Overwrite

Python Script Example

Mathematical Details

Configuration Requirements

Filelist Format

Performance Considerations

Memory Usage

Processing Time

Normalization in Model

Error Handling

File Already Exists

Config Not Found

Invalid Batch Size

Best Practices

Related Commands

Source Reference