This guide shows you how to train Matcha-TTS on your own custom dataset beyond the standard LJSpeech or VCTK datasets.
Overview
Training on a custom dataset involves:
- Preparing your audio data and transcriptions
- Creating a dataset configuration file
- Computing normalization statistics
- Training the model
Creating a Custom Dataset Config
Create config file
Create a new YAML file in configs/data/ for your dataset:touch configs/data/my_dataset.yaml
Define dataset configuration
Edit configs/data/my_dataset.yaml with your dataset parameters:_target_: matcha.data.text_mel_datamodule.TextMelDataModule
name: my_dataset
train_filelist_path: data/MyDataset/train.txt
valid_filelist_path: data/MyDataset/val.txt
batch_size: 32
num_workers: 8
pin_memory: True
cleaners: [english_cleaners2] # Or your custom cleaner
add_blank: True
n_spks: 1 # Set to >1 for multi-speaker
# Audio parameters
n_fft: 1024
n_feats: 80
sample_rate: 22050 # Match your audio sample rate
hop_length: 256
win_length: 1024
f_min: 0
f_max: 8000
# Data statistics (compute these in step 3)
data_statistics:
mel_mean: 0.0 # Replace with computed values
mel_std: 1.0 # Replace with computed values
seed: ${seed}
load_durations: false
Adjust audio parameters
The sample_rate must match your audio files. If your audio is 16kHz, set sample_rate: 16000 and adjust f_max accordingly (typically sample_rate / 2).
Common configurations:For 22.05 kHz audio (LJSpeech standard):sample_rate: 22050
hop_length: 256
win_length: 1024
f_max: 8000
For 16 kHz audio:sample_rate: 16000
hop_length: 200
win_length: 800
f_max: 8000
For 24 kHz audio:sample_rate: 24000
hop_length: 300
win_length: 1200
f_max: 12000
Computing Dataset Statistics
Normalization statistics are critical for training stability.
Run statistics computation
Use the matcha-data-stats command:matcha-data-stats -i my_dataset.yaml
This computes the mean and standard deviation of mel-spectrograms across your entire training set. Update configuration
The command outputs statistics like:{'mel_mean': -5.234567, 'mel_std': 2.345678}
Update configs/data/my_dataset.yaml:data_statistics:
mel_mean: -5.234567
mel_std: 2.345678
Verify computation
The statistics should be:
mel_mean: Typically between -7 and -4
mel_std: Typically between 1.5 and 3.0
If values are very different, verify:
- Audio files are correctly formatted
- Sample rate matches configuration
- File paths in filelists are correct
Creating an Experiment Configuration
Create an experiment config to easily run training:
Create experiment file
Create configs/experiment/my_dataset.yaml:# @package _global_
defaults:
- override /data: my_dataset.yaml
tags: ["my_dataset", "custom"]
run_name: my_dataset_experiment
# Optional: override any parameters
# data:
# batch_size: 16
#
# model:
# out_size: 172 # For memory-constrained training
Train with experiment config
python matcha/train.py experiment=my_dataset
Multi-Speaker Custom Dataset
For a custom multi-speaker dataset:
Prepare multi-speaker filelists
Format: path|speaker_id|textdata/MyDataset/speaker001/audio001.wav|0|First speaker text
data/MyDataset/speaker002/audio001.wav|1|Second speaker text
data/MyDataset/speaker001/audio002.wav|0|More from first speaker
Speaker IDs should be:
- Consecutive integers starting from 0
- Consistent across train and validation sets
Update config with speaker count
In configs/data/my_dataset.yaml:n_spks: 10 # Replace with your actual speaker count
Configure speaker embeddings
In your experiment config, you can adjust speaker embedding size:model:
spk_emb_dim: 64 # Higher for more speakers (e.g., 128 for 100+ speakers)
Using Custom Text Cleaners
If your dataset uses a different language or requires special preprocessing:
Create custom cleaner
Add a new cleaner function in matcha/text/cleaners.py:def my_custom_cleaner(text):
"""Custom text cleaning pipeline."""
text = text.lower()
# Add your custom preprocessing
return text
Register the cleaner
Make sure it’s imported and available in the cleaners module.
Use in config
Update configs/data/my_dataset.yaml:cleaners: [my_custom_cleaner]
Example: Non-English Dataset
For a German language dataset:
# configs/data/german_dataset.yaml
_target_: matcha.data.text_mel_datamodule.TextMelDataModule
name: german_tts
train_filelist_path: data/GermanTTS/train.txt
valid_filelist_path: data/GermanTTS/val.txt
batch_size: 32
num_workers: 8
pin_memory: True
cleaners: [basic_cleaners] # Use basic or create German-specific cleaner
add_blank: True
n_spks: 1
n_fft: 1024
n_feats: 80
sample_rate: 22050
hop_length: 256
win_length: 1024
f_min: 0
f_max: 8000
data_statistics:
mel_mean: -5.123456 # Computed from your dataset
mel_std: 2.234567
seed: ${seed}
load_durations: false
Dealing with Long Audio Files
For datasets with very long audio files:
Very long audio files (>10 seconds) can cause memory issues during training. Consider splitting them into shorter segments.
- Split long files into 2-10 second segments
- Update filelists to reference the segmented files
- Adjust batch size based on average audio length
Small Dataset Considerations
For small datasets (less than 1000 utterances):
- Reduce batch size to increase gradient updates per epoch
- Enable pre-computed durations to help the model learn alignments
- Consider augmentation techniques
- Monitor for overfitting
data:
batch_size: 8 # Smaller batch for more frequent updates
trainer:
max_epochs: 2000 # More epochs for small datasets
Validation Tips
Verify file paths
Check that all audio files referenced in filelists exist:# Quick validation script
while IFS='|' read -r path text; do
if [ ! -f "$path" ]; then
echo "Missing: $path"
fi
done < data/MyDataset/train.txt
Check audio format
Verify sample rate consistency:# Check sample rates
soxi -r data/MyDataset/wavs/*.wav | sort -u
Should output only one sample rate. Test data loading
Before full training, test that data loads correctly:from matcha.data.text_mel_datamodule import TextMelDataModule
from hydra import compose, initialize
with initialize(version_base="1.3", config_path="../configs/data"):
cfg = compose(config_name="my_dataset.yaml")
dm = TextMelDataModule(**cfg)
dm.setup()
batch = next(iter(dm.train_dataloader()))
print(f"Batch shape: {batch['x'].shape}, {batch['y'].shape}")
Complete Example Workflow
# 1. Prepare your dataset structure
mkdir -p data/MyDataset/wavs
# ... copy audio files and create filelists ...
# 2. Create dataset config
cp configs/data/ljspeech.yaml configs/data/my_dataset.yaml
# ... edit my_dataset.yaml with your paths and parameters ...
# 3. Compute statistics
matcha-data-stats -i my_dataset.yaml
# 4. Update config with statistics
# ... edit configs/data/my_dataset.yaml ...
# 5. Create experiment config
cp configs/experiment/ljspeech.yaml configs/experiment/my_dataset.yaml
# ... edit to reference your dataset ...
# 6. Start training
python matcha/train.py experiment=my_dataset
Next Steps