Training Guide

This guide walks you through the complete process of training a Matcha-TTS model from scratch.

Prerequisites

Before starting, ensure you have:

Prepared dataset in the correct format
Installed Matcha-TTS from source
GPU with sufficient VRAM (8GB minimum, 16GB+ recommended)

Installation for Training

Clone the repository

git clone https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS

Create environment

conda create -n matcha-tts python=3.10 -y
conda activate matcha-tts

Install from source

pip install -e .

Training on LJSpeech

This example demonstrates training on the LJSpeech dataset.

Prepare dataset

Download and prepare LJSpeech as described in the Dataset Preparation guide.Your structure should be:

data/LJSpeech-1.1/
├── train.txt
├── val.txt
└── wavs/

Update dataset configuration

Edit configs/data/ljspeech.yaml to point to your filelist paths:

train_filelist_path: data/LJSpeech-1.1/train.txt
valid_filelist_path: data/LJSpeech-1.1/val.txt

Generate normalization statistics

Compute mel-spectrogram mean and standard deviation for your dataset:

matcha-data-stats -i ljspeech.yaml

This outputs:

{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}

This step is crucial for training stability. The statistics are used to normalize mel-spectrograms during training.

Update statistics in config

Add the computed statistics to configs/data/ljspeech.yaml:

data_statistics:
  mel_mean: -5.536622
  mel_std: 2.116101

Start training

Run training with the LJSpeech experiment configuration:

python matcha/train.py experiment=ljspeech

Or using the Makefile:

make train-ljspeech

Training Commands

Basic Training

Train with default settings:

python matcha/train.py experiment=ljspeech

Multi-GPU Training

Train on multiple GPUs (e.g., GPUs 0 and 1):

python matcha/train.py experiment=ljspeech trainer.devices=[0,1]

Minimum Memory Mode

For systems with limited GPU memory:

python matcha/train.py experiment=ljspeech_min_memory

This configuration reduces the model size (out_size: 172) to fit in smaller GPUs.

Resume from Checkpoint

Resume training from a saved checkpoint:

python matcha/train.py experiment=ljspeech ckpt_path=/path/to/checkpoint.ckpt

Override Configuration

You can override any configuration parameter from the command line:

# Change batch size
python matcha/train.py experiment=ljspeech data.batch_size=16

# Change learning rate
python matcha/train.py experiment=ljspeech model.optimizer.lr=0.0001

# Multiple overrides
python matcha/train.py experiment=ljspeech \
  data.batch_size=16 \
  trainer.max_epochs=1000 \
  trainer.devices=[0]

Training Configuration

Key Hyperparameters

Important parameters in the training configuration:

# Data configuration (configs/data/ljspeech.yaml)
batch_size: 32          # Batch size per GPU
num_workers: 20         # Data loading workers

# Audio parameters
n_fft: 1024            # FFT size
n_feats: 80            # Mel channels
sample_rate: 22050     # Audio sample rate
hop_length: 256        # Hop length for STFT
win_length: 1024       # Window length for STFT

# Model configuration (configs/model/matcha.yaml)
n_vocab: 178           # Vocabulary size
n_spks: 1              # Number of speakers (1 for single-speaker)
spk_emb_dim: 64        # Speaker embedding dimension
out_size: null         # Decoder output size (null = auto)

Training Duration

Training typically requires:

LJSpeech (single-speaker): ~200-300k steps for good quality
VCTK (multi-speaker): ~500k+ steps
Time: Several days on a single GPU

Monitoring Training

TensorBoard

By default, training logs are saved with TensorBoard:

tensorboard --logdir logs/

Monitor:

Training/validation loss
Mel-spectrogram predictions
Attention alignments
Duration predictions

Checkpoints

Model checkpoints are saved in logs/train/runs/<timestamp>/checkpoints/:

last.ckpt - Most recent checkpoint
epoch_*.ckpt - Periodic checkpoints
Best checkpoint based on validation loss

Multi-Speaker Training

To train a multi-speaker model (e.g., VCTK):

Prepare multi-speaker dataset

Ensure your filelist includes speaker IDs:

data/vctk/wav48/p225/p225_001.wav|0|Transcription here
data/vctk/wav48/p226/p226_001.wav|1|Transcription here

Update dataset config

Set the number of speakers in configs/data/vctk.yaml:

n_spks: 109  # Total number of unique speakers

Generate statistics

matcha-data-stats -i vctk.yaml

Train

python matcha/train.py experiment=multispeaker

Training with Pre-computed Durations

For faster convergence, you can train using pre-extracted phoneme durations:

Extract durations

First, train a base model, then extract durations as described in the Duration Extraction guide.

Enable duration loading

Set load_durations: True in your experiment config:

data:
  load_durations: True
  batch_size: 64  # Can use larger batch size

Train

python matcha/train.py experiment=ljspeech_from_durations

Troubleshooting

Out of Memory (OOM) Errors

If you encounter OOM errors:

Reduce batch size:

python matcha/train.py experiment=ljspeech data.batch_size=16

Use minimum memory configuration:

python matcha/train.py experiment=ljspeech_min_memory

Reduce number of workers:

python matcha/train.py experiment=ljspeech data.num_workers=4

Slow Training

Increase num_workers for faster data loading
Enable mixed precision training (enabled by default with Lightning)
Use multiple GPUs

Poor Audio Quality

Verify dataset statistics are correct
Check audio preprocessing (sample rate, normalization)
Train for more steps
Validate data quality and transcription accuracy

Synthesis from Trained Model

Once training is complete, synthesize speech from your model:

matcha-tts --text "Hello, this is a test." \
  --checkpoint_path logs/train/runs/<timestamp>/checkpoints/epoch_100.ckpt

For more synthesis options, see the main Inference documentation.

Next Steps

Extract phoneme durations for improved training
Configure custom datasets for your specific needs
Learn about configuration options in detail

Get Started

Core Concepts

Training

Inference

Advanced

Prerequisites

Installation for Training

Training on LJSpeech

Training Commands

Basic Training

Multi-GPU Training

Minimum Memory Mode

Resume from Checkpoint

Override Configuration

Training Configuration

Key Hyperparameters

Training Duration

Monitoring Training

TensorBoard

Checkpoints

Multi-Speaker Training

Training with Pre-computed Durations

Troubleshooting

Out of Memory (OOM) Errors

Slow Training

Poor Audio Quality

Synthesis from Trained Model

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training

Inference

Advanced

​Prerequisites

​Installation for Training

​Training on LJSpeech

​Training Commands

​Basic Training

​Multi-GPU Training

​Minimum Memory Mode

​Resume from Checkpoint

​Override Configuration

​Training Configuration

​Key Hyperparameters

​Training Duration

​Monitoring Training

​TensorBoard

​Checkpoints

​Multi-Speaker Training

​Training with Pre-computed Durations

​Troubleshooting

​Out of Memory (OOM) Errors

​Slow Training

​Poor Audio Quality

​Synthesis from Trained Model

​Next Steps

Build docs developers (and LLMs) love

Prerequisites

Installation for Training

Training on LJSpeech

Training Commands

Basic Training

Multi-GPU Training

Minimum Memory Mode

Resume from Checkpoint

Override Configuration

Training Configuration

Key Hyperparameters

Training Duration

Monitoring Training

TensorBoard

Checkpoints

Multi-Speaker Training

Training with Pre-computed Durations

Troubleshooting

Out of Memory (OOM) Errors

Slow Training

Poor Audio Quality

Synthesis from Trained Model

Next Steps