Prerequisites
Before starting, ensure you have:- Prepared dataset in the correct format
- Installed Matcha-TTS from source
- GPU with sufficient VRAM (8GB minimum, 16GB+ recommended)
Installation for Training
Training on LJSpeech
This example demonstrates training on the LJSpeech dataset.Prepare dataset
Download and prepare LJSpeech as described in the Dataset Preparation guide.Your structure should be:
Generate normalization statistics
Compute mel-spectrogram mean and standard deviation for your dataset:This outputs:
Training Commands
Basic Training
Train with default settings:Multi-GPU Training
Train on multiple GPUs (e.g., GPUs 0 and 1):Minimum Memory Mode
For systems with limited GPU memory:out_size: 172) to fit in smaller GPUs.
Resume from Checkpoint
Resume training from a saved checkpoint:Override Configuration
You can override any configuration parameter from the command line:Training Configuration
Key Hyperparameters
Important parameters in the training configuration:Training Duration
Training typically requires:- LJSpeech (single-speaker): ~200-300k steps for good quality
- VCTK (multi-speaker): ~500k+ steps
- Time: Several days on a single GPU
Monitoring Training
TensorBoard
By default, training logs are saved with TensorBoard:- Training/validation loss
- Mel-spectrogram predictions
- Attention alignments
- Duration predictions
Checkpoints
Model checkpoints are saved inlogs/train/runs/<timestamp>/checkpoints/:
last.ckpt- Most recent checkpointepoch_*.ckpt- Periodic checkpoints- Best checkpoint based on validation loss
Multi-Speaker Training
To train a multi-speaker model (e.g., VCTK):Training with Pre-computed Durations
For faster convergence, you can train using pre-extracted phoneme durations:Extract durations
First, train a base model, then extract durations as described in the Duration Extraction guide.
Troubleshooting
Out of Memory (OOM) Errors
If you encounter OOM errors:-
Reduce batch size:
-
Use minimum memory configuration:
-
Reduce number of workers:
Slow Training
- Increase
num_workersfor faster data loading - Enable mixed precision training (enabled by default with Lightning)
- Use multiple GPUs
Poor Audio Quality
- Verify dataset statistics are correct
- Check audio preprocessing (sample rate, normalization)
- Train for more steps
- Validate data quality and transcription accuracy
Synthesis from Trained Model
Once training is complete, synthesize speech from your model:Next Steps
- Extract phoneme durations for improved training
- Configure custom datasets for your specific needs
- Learn about configuration options in detail