Custom Dataset

This guide shows you how to train Matcha-TTS on your own custom dataset beyond the standard LJSpeech or VCTK datasets.

Overview

Training on a custom dataset involves:

Preparing your audio data and transcriptions
Creating a dataset configuration file
Computing normalization statistics
Training the model

Creating a Custom Dataset Config

Create config file

Create a new YAML file in configs/data/ for your dataset:

touch configs/data/my_dataset.yaml

Define dataset configuration

Edit configs/data/my_dataset.yaml with your dataset parameters:

_target_: matcha.data.text_mel_datamodule.TextMelDataModule
name: my_dataset
train_filelist_path: data/MyDataset/train.txt
valid_filelist_path: data/MyDataset/val.txt
batch_size: 32
num_workers: 8
pin_memory: True
cleaners: [english_cleaners2]  # Or your custom cleaner
add_blank: True
n_spks: 1  # Set to >1 for multi-speaker

# Audio parameters
n_fft: 1024
n_feats: 80
sample_rate: 22050  # Match your audio sample rate
hop_length: 256
win_length: 1024
f_min: 0
f_max: 8000

# Data statistics (compute these in step 3)
data_statistics:
  mel_mean: 0.0  # Replace with computed values
  mel_std: 1.0   # Replace with computed values

seed: ${seed}
load_durations: false

Adjust audio parameters

The sample_rate must match your audio files. If your audio is 16kHz, set sample_rate: 16000 and adjust f_max accordingly (typically sample_rate / 2).

Common configurations:For 22.05 kHz audio (LJSpeech standard):

sample_rate: 22050
hop_length: 256
win_length: 1024
f_max: 8000

For 16 kHz audio:

sample_rate: 16000
hop_length: 200
win_length: 800
f_max: 8000

For 24 kHz audio:

sample_rate: 24000
hop_length: 300
win_length: 1200
f_max: 12000

Computing Dataset Statistics

Normalization statistics are critical for training stability.

Run statistics computation

Use the matcha-data-stats command:

matcha-data-stats -i my_dataset.yaml

This computes the mean and standard deviation of mel-spectrograms across your entire training set.

Update configuration

The command outputs statistics like:

{'mel_mean': -5.234567, 'mel_std': 2.345678}

Update configs/data/my_dataset.yaml:

data_statistics:
  mel_mean: -5.234567
  mel_std: 2.345678

Verify computation

The statistics should be:

mel_mean: Typically between -7 and -4
mel_std: Typically between 1.5 and 3.0

If values are very different, verify:

Audio files are correctly formatted
Sample rate matches configuration
File paths in filelists are correct

Creating an Experiment Configuration

Create an experiment config to easily run training:

Create experiment file

Create configs/experiment/my_dataset.yaml:

# @package _global_

defaults:
  - override /data: my_dataset.yaml

tags: ["my_dataset", "custom"]

run_name: my_dataset_experiment

# Optional: override any parameters
# data:
#   batch_size: 16
# 
# model:
#   out_size: 172  # For memory-constrained training

Train with experiment config

python matcha/train.py experiment=my_dataset

Multi-Speaker Custom Dataset

For a custom multi-speaker dataset:

Prepare multi-speaker filelists

Format: path|speaker_id|text

data/MyDataset/speaker001/audio001.wav|0|First speaker text
data/MyDataset/speaker002/audio001.wav|1|Second speaker text
data/MyDataset/speaker001/audio002.wav|0|More from first speaker

Speaker IDs should be:

Consecutive integers starting from 0
Consistent across train and validation sets

Update config with speaker count

In configs/data/my_dataset.yaml:

n_spks: 10  # Replace with your actual speaker count

Configure speaker embeddings

In your experiment config, you can adjust speaker embedding size:

model:
  spk_emb_dim: 64  # Higher for more speakers (e.g., 128 for 100+ speakers)

Using Custom Text Cleaners

If your dataset uses a different language or requires special preprocessing:

Create custom cleaner

Add a new cleaner function in matcha/text/cleaners.py:

def my_custom_cleaner(text):
    """Custom text cleaning pipeline."""
    text = text.lower()
    # Add your custom preprocessing
    return text

Make sure it’s imported and available in the cleaners module.

Use in config

Update configs/data/my_dataset.yaml:

cleaners: [my_custom_cleaner]

Example: Non-English Dataset

For a German language dataset:

# configs/data/german_dataset.yaml
_target_: matcha.data.text_mel_datamodule.TextMelDataModule
name: german_tts
train_filelist_path: data/GermanTTS/train.txt
valid_filelist_path: data/GermanTTS/val.txt
batch_size: 32
num_workers: 8
pin_memory: True
cleaners: [basic_cleaners]  # Use basic or create German-specific cleaner
add_blank: True
n_spks: 1

n_fft: 1024
n_feats: 80
sample_rate: 22050
hop_length: 256
win_length: 1024
f_min: 0
f_max: 8000

data_statistics:
  mel_mean: -5.123456  # Computed from your dataset
  mel_std: 2.234567

seed: ${seed}
load_durations: false

Dealing with Long Audio Files

For datasets with very long audio files:

Very long audio files (>10 seconds) can cause memory issues during training. Consider splitting them into shorter segments.

Split long files into 2-10 second segments
Update filelists to reference the segmented files
Adjust batch size based on average audio length

Small Dataset Considerations

For small datasets (less than 1000 utterances):

Reduce batch size to increase gradient updates per epoch
Enable pre-computed durations to help the model learn alignments
Consider augmentation techniques
Monitor for overfitting

data:
  batch_size: 8  # Smaller batch for more frequent updates
  
trainer:
  max_epochs: 2000  # More epochs for small datasets

Validation Tips

Verify file paths

Check that all audio files referenced in filelists exist:

# Quick validation script
while IFS='|' read -r path text; do
  if [ ! -f "$path" ]; then
    echo "Missing: $path"
  fi
done < data/MyDataset/train.txt

Check audio format

Verify sample rate consistency:

# Check sample rates
soxi -r data/MyDataset/wavs/*.wav | sort -u

Should output only one sample rate.

Test data loading

Before full training, test that data loads correctly:

from matcha.data.text_mel_datamodule import TextMelDataModule
from hydra import compose, initialize

with initialize(version_base="1.3", config_path="../configs/data"):
    cfg = compose(config_name="my_dataset.yaml")

dm = TextMelDataModule(**cfg)
dm.setup()
batch = next(iter(dm.train_dataloader()))
print(f"Batch shape: {batch['x'].shape}, {batch['y'].shape}")

Complete Example Workflow

# 1. Prepare your dataset structure
mkdir -p data/MyDataset/wavs
# ... copy audio files and create filelists ...

# 2. Create dataset config
cp configs/data/ljspeech.yaml configs/data/my_dataset.yaml
# ... edit my_dataset.yaml with your paths and parameters ...

# 3. Compute statistics
matcha-data-stats -i my_dataset.yaml

# 4. Update config with statistics
# ... edit configs/data/my_dataset.yaml ...

# 5. Create experiment config
cp configs/experiment/ljspeech.yaml configs/experiment/my_dataset.yaml
# ... edit to reference your dataset ...

# 6. Start training
python matcha/train.py experiment=my_dataset

Get Started

Core Concepts

Training

Inference

Advanced

Overview

Creating a Custom Dataset Config

Computing Dataset Statistics

Creating an Experiment Configuration

Multi-Speaker Custom Dataset

Using Custom Text Cleaners

Example: Non-English Dataset

Dealing with Long Audio Files

Small Dataset Considerations

Validation Tips

Complete Example Workflow

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training

Inference

Advanced

​Overview

​Creating a Custom Dataset Config

​Computing Dataset Statistics

​Creating an Experiment Configuration

​Multi-Speaker Custom Dataset

​Using Custom Text Cleaners

​Example: Non-English Dataset

​Dealing with Long Audio Files

​Small Dataset Considerations

​Validation Tips

​Complete Example Workflow

​Next Steps

Build docs developers (and LLMs) love

Overview

Creating a Custom Dataset Config

Computing Dataset Statistics

Creating an Experiment Configuration

Multi-Speaker Custom Dataset

Using Custom Text Cleaners

Example: Non-English Dataset

Dealing with Long Audio Files

Small Dataset Considerations

Validation Tips

Complete Example Workflow

Next Steps