Dataset Structure
Matcha-TTS follows the LJSpeech dataset format. Your dataset should be structured as:Filelist Format
Single-Speaker Datasets
For single-speaker datasets (like LJSpeech), create filelist files with the following format:- Audio file path (relative or absolute)
- Pipe separator
| - Text transcription
train.txt):
Multi-Speaker Datasets
For multi-speaker datasets (like VCTK), add a speaker ID between the path and text:- Audio file path
- Pipe separator
| - Speaker ID (integer)
- Pipe separator
| - Text transcription
Audio Requirements
Your audio files must meet these specifications:- Format: WAV (uncompressed PCM)
- Sample Rate: 22050 Hz (configurable)
- Channels: Mono (single channel)
- Bit Depth: 16-bit recommended
Preparing LJSpeech Dataset
Download the dataset
Download LJSpeech from the official website.
Extract the archive
Extract to your data directory:This creates
data/LJSpeech-1.1/ with the required structure.Create train/val splits
Split
metadata.csv into training and validation sets. The LJSpeech dataset contains 13,100 audio clips.You can use the NVIDIA Tacotron 2 repository’s file lists as described in item 5 of their setup guide, or create your own:Preparing Custom Datasets
Create transcriptions
Create filelist files (
train.txt, val.txt) with audio paths and transcriptions in the format described above.Verify audio format
Ensure all audio files are:
- WAV format
- Same sample rate (e.g., 22050 Hz)
- Mono channel
sox or ffmpeg to convert:Text Preprocessing
Matcha-TTS includes several text cleaners for preprocessing:english_cleaners- Basic English text cleaningenglish_cleaners2- Advanced English cleaning (recommended for LJSpeech)
- Number expansion (“123” → “one hundred twenty three”)
- Abbreviation expansion (“Dr.” → “Doctor”)
- Lowercase conversion
- Punctuation normalization
Validation Checklist
Before training, verify:- All audio files are in WAV format with correct sample rate
- Filelist format matches your speaker configuration (single/multi-speaker)
- No missing audio files referenced in filelists
- Train/validation splits are created
- No overlap between train and validation sets
- All transcriptions are accurate
- File paths in filelists are correct (relative or absolute)
Next Steps
Once your dataset is prepared:- Configure your dataset for training
- Follow the Training Guide to start training