Overview
Matcha-TTS provides a powerful command-line interface for text-to-speech synthesis. The CLI supports both single-speaker and multi-speaker models, batch processing, and extensive customization of synthesis parameters.Basic Usage
Available Models
Matcha-TTS comes with two pre-trained models:- matcha_ljspeech: Single-speaker model trained on LJ Speech dataset
- matcha_vctk: Multi-speaker model trained on VCTK dataset (108 speakers, IDs 0-107)
CLI Arguments
Model Selection
Model to use for synthesis. Choices:
matcha_ljspeech, matcha_vctkThe model determines the voice quality and available speakers.Path to a custom model checkpoint. Use this to load your own trained models.When using a custom checkpoint, consider using
--vocoder hifigan_univ_v1 for best compatibility.Vocoder to use for waveform generation. Choices:
hifigan_T2_v1, hifigan_univ_v1Default vocoder is automatically selected based on the model:matcha_ljspeech→hifigan_T2_v1matcha_vctk→hifigan_univ_v1
Input
Text to synthesize. Either
--text or --file must be provided.Example: --text "The quick brown fox jumps over the lazy dog"Path to text file with sentences to synthesize (one per line).When using
--file, consider using --batched for faster processing.Speaker Control
Speaker ID for multi-speaker models.
- For
matcha_ljspeech: Not applicable (single speaker) - For
matcha_vctk: Valid range is 0-107
--spk 42Synthesis Parameters
Variance of the noise distribution (controls prosody variation).
- Higher values (e.g., 1.0): More expressive but less stable
- Lower values (e.g., 0.3): More stable but less expressive
- Must be ≥ 0
Controls speech pace. Higher values = slower speech.Default values:
matcha_ljspeech: 0.95matcha_vctk: 0.85
Number of ODE (Ordinary Differential Equation) steps for synthesis.
- More steps: Higher quality, slower synthesis
- Fewer steps: Faster synthesis, slight quality reduction
- Recommended range: 4-50
- Must be > 0
--steps 20 for higher qualityVocoder Options
Strength of the vocoder bias denoiser.Higher values reduce background noise but may affect audio quality.
Output Options
Directory where generated audio files will be saved.Example:
--output_folder ./outputsPerformance Options
Enable batched inference for processing multiple texts.Use this flag when processing files with multiple sentences for significant speedup.
Batch size for batched inference (only used with
--batched).Larger batches are faster but require more memory.Force CPU inference even if GPU is available.By default, GPU is used if available.
Examples
Single Speaker Synthesis
Multi-Speaker Synthesis
Batch Processing
Custom Models
Output Files
The CLI generates three files per utterance:- .wav: Audio file (22050 Hz, PCM_24)
- .npy: Mel-spectrogram (NumPy format)
- .png: Mel-spectrogram visualization
- Single speaker:
utterance_001.wav,utterance_002.wav, … - Multi-speaker:
utterance_001_speaker_042.wav, …
Performance Metrics
The CLI reports Real-Time Factor (RTF) metrics:- Matcha-TTS RTF: Mel-spectrogram generation speed
- Matcha-TTS + VOCODER RTF: End-to-end synthesis speed
Tips
- Start with default parameters and adjust based on your needs
- Use
--batchedfor files with multiple sentences (10x+ speedup) - Try different
--stepsvalues (2, 4, 10, 50) to balance quality/speed - Use
--temperature 0.667as a good default for natural prosody - For custom models,
hifigan_univ_v1vocoder usually works best