Skip to main content
This guide walks you through the complete process of training a Matcha-TTS model from scratch.

Prerequisites

Before starting, ensure you have:
  • Prepared dataset in the correct format
  • Installed Matcha-TTS from source
  • GPU with sufficient VRAM (8GB minimum, 16GB+ recommended)

Installation for Training

1

Clone the repository

git clone https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS
2

Create environment

conda create -n matcha-tts python=3.10 -y
conda activate matcha-tts
3

Install from source

pip install -e .

Training on LJSpeech

This example demonstrates training on the LJSpeech dataset.
1

Prepare dataset

Download and prepare LJSpeech as described in the Dataset Preparation guide.Your structure should be:
data/LJSpeech-1.1/
├── train.txt
├── val.txt
└── wavs/
2

Update dataset configuration

Edit configs/data/ljspeech.yaml to point to your filelist paths:
train_filelist_path: data/LJSpeech-1.1/train.txt
valid_filelist_path: data/LJSpeech-1.1/val.txt
3

Generate normalization statistics

Compute mel-spectrogram mean and standard deviation for your dataset:
matcha-data-stats -i ljspeech.yaml
This outputs:
{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}
This step is crucial for training stability. The statistics are used to normalize mel-spectrograms during training.
4

Update statistics in config

Add the computed statistics to configs/data/ljspeech.yaml:
data_statistics:
  mel_mean: -5.536622
  mel_std: 2.116101
5

Start training

Run training with the LJSpeech experiment configuration:
python matcha/train.py experiment=ljspeech
Or using the Makefile:
make train-ljspeech

Training Commands

Basic Training

Train with default settings:
python matcha/train.py experiment=ljspeech

Multi-GPU Training

Train on multiple GPUs (e.g., GPUs 0 and 1):
python matcha/train.py experiment=ljspeech trainer.devices=[0,1]

Minimum Memory Mode

For systems with limited GPU memory:
python matcha/train.py experiment=ljspeech_min_memory
This configuration reduces the model size (out_size: 172) to fit in smaller GPUs.

Resume from Checkpoint

Resume training from a saved checkpoint:
python matcha/train.py experiment=ljspeech ckpt_path=/path/to/checkpoint.ckpt

Override Configuration

You can override any configuration parameter from the command line:
# Change batch size
python matcha/train.py experiment=ljspeech data.batch_size=16

# Change learning rate
python matcha/train.py experiment=ljspeech model.optimizer.lr=0.0001

# Multiple overrides
python matcha/train.py experiment=ljspeech \
  data.batch_size=16 \
  trainer.max_epochs=1000 \
  trainer.devices=[0]

Training Configuration

Key Hyperparameters

Important parameters in the training configuration:
# Data configuration (configs/data/ljspeech.yaml)
batch_size: 32          # Batch size per GPU
num_workers: 20         # Data loading workers

# Audio parameters
n_fft: 1024            # FFT size
n_feats: 80            # Mel channels
sample_rate: 22050     # Audio sample rate
hop_length: 256        # Hop length for STFT
win_length: 1024       # Window length for STFT

# Model configuration (configs/model/matcha.yaml)
n_vocab: 178           # Vocabulary size
n_spks: 1              # Number of speakers (1 for single-speaker)
spk_emb_dim: 64        # Speaker embedding dimension
out_size: null         # Decoder output size (null = auto)

Training Duration

Training typically requires:
  • LJSpeech (single-speaker): ~200-300k steps for good quality
  • VCTK (multi-speaker): ~500k+ steps
  • Time: Several days on a single GPU

Monitoring Training

TensorBoard

By default, training logs are saved with TensorBoard:
tensorboard --logdir logs/
Monitor:
  • Training/validation loss
  • Mel-spectrogram predictions
  • Attention alignments
  • Duration predictions

Checkpoints

Model checkpoints are saved in logs/train/runs/<timestamp>/checkpoints/:
  • last.ckpt - Most recent checkpoint
  • epoch_*.ckpt - Periodic checkpoints
  • Best checkpoint based on validation loss

Multi-Speaker Training

To train a multi-speaker model (e.g., VCTK):
1

Prepare multi-speaker dataset

Ensure your filelist includes speaker IDs:
data/vctk/wav48/p225/p225_001.wav|0|Transcription here
data/vctk/wav48/p226/p226_001.wav|1|Transcription here
2

Update dataset config

Set the number of speakers in configs/data/vctk.yaml:
n_spks: 109  # Total number of unique speakers
3

Generate statistics

matcha-data-stats -i vctk.yaml
4

Train

python matcha/train.py experiment=multispeaker

Training with Pre-computed Durations

For faster convergence, you can train using pre-extracted phoneme durations:
1

Extract durations

First, train a base model, then extract durations as described in the Duration Extraction guide.
2

Enable duration loading

Set load_durations: True in your experiment config:
data:
  load_durations: True
  batch_size: 64  # Can use larger batch size
3

Train

python matcha/train.py experiment=ljspeech_from_durations

Troubleshooting

Out of Memory (OOM) Errors

If you encounter OOM errors:
  1. Reduce batch size:
    python matcha/train.py experiment=ljspeech data.batch_size=16
    
  2. Use minimum memory configuration:
    python matcha/train.py experiment=ljspeech_min_memory
    
  3. Reduce number of workers:
    python matcha/train.py experiment=ljspeech data.num_workers=4
    

Slow Training

  • Increase num_workers for faster data loading
  • Enable mixed precision training (enabled by default with Lightning)
  • Use multiple GPUs

Poor Audio Quality

  • Verify dataset statistics are correct
  • Check audio preprocessing (sample rate, normalization)
  • Train for more steps
  • Validate data quality and transcription accuracy

Synthesis from Trained Model

Once training is complete, synthesize speech from your model:
matcha-tts --text "Hello, this is a test." \
  --checkpoint_path logs/train/runs/<timestamp>/checkpoints/epoch_100.ckpt
For more synthesis options, see the main Inference documentation.

Next Steps

Build docs developers (and LLMs) love