Overview
Matcha-TTS provides several command-line tools for text-to-speech synthesis, data preprocessing, and model utilities. All commands are available after installing the package.Installation
Commands
matcha-tts
Main command for text-to-speech synthesis.Options
Model to use for synthesisChoices:
matcha_ljspeech, matcha_vctkPath to a custom model checkpoint (overrides —model)
Vocoder to use (defaults to the recommended vocoder for the model)Choices:
hifigan_T2_v1, hifigan_univ_v1Text to synthesize (single utterance)
Path to text file with utterances (one per line)
Speaker ID for multi-speaker models (0-107 for matcha_vctk)
Variance of the noise. Higher = more diverse output. Range: 0.0-2.0
Speaking rate control. Higher = slower speech. Default: 1.0 for custom models, 0.95 for ljspeech, 0.85 for vctk
Number of ODE solver steps. More steps = better quality but slower. Typical range: 4-20
Force CPU inference (default: use GPU if available)
Strength of the vocoder bias denoiser
Output folder to save synthesized audio and mel-spectrograms
Enable batched inference for processing multiple utterances
Batch size for batched inference
Examples
Single speaker synthesis:matcha-data-stats
Compute mel-spectrogram statistics for dataset normalization.Options
Name of the YAML config file under configs/data/
Batch size for computation (higher = faster)
Force overwrite existing output file
Output
Creates a JSON file with mel-spectrogram statistics:Example
matcha-tts-get-durations
Extract phoneme durations from a trained model using Monotonic Alignment Search.Options
Name of the YAML config file under configs/data/
Path to the trained model checkpoint
Batch size for processing
Output folder for durations (defaults to data_path/durations/)
Force overwrite existing duration files
Use CPU instead of GPU (not recommended)
Output
For each audio file, generates:filename.npy: NumPy array of phoneme durationsfilename.json: JSON with phoneme-duration pairs
Example
matcha-tts-app
Launch interactive Gradio web interface for synthesis.- Select pre-trained models
- Choose speakers (for multi-speaker models)
- Adjust synthesis parameters
- Type or paste text
- Generate and play audio
Example
Output Files
Audio Files
Generated asutterance_XXX_speaker_YYY.wav (or utterance_XXX.wav for single-speaker):
- Format: WAV
- Sample rate: 22050 Hz
- Bit depth: 24-bit PCM
Mel-Spectrogram Files
Saved alongside audio as:utterance_XXX.npy: NumPy array of mel-spectrogramutterance_XXX.png: Visualization of mel-spectrogram
Environment Variables
Matcha-TTS uses the following directories:- Model cache:
~/.local/share/matcha_tts/(Linux/Mac) or%LOCALAPPDATA%\matcha_tts\(Windows) - Downloaded models are cached here automatically
Performance Tips
Speed Optimization
- Reduce ODE steps: Use
--steps 4for faster synthesis (slight quality loss) - Enable batched inference: Use
--batchedfor multiple utterances - GPU acceleration: Ensure CUDA is available (much faster than CPU)
Quality Optimization
- Increase ODE steps: Use
--steps 15-20for better quality - Adjust temperature: Lower values (0.3-0.5) for more consistent output
- Fine-tune speaking rate: Adjust
--speaking_ratefor natural pacing
Pretrained Models
matcha_ljspeech
- Dataset: LJ Speech (single female speaker)
- Recommended vocoder:
hifigan_T2_v1 - Default speaking rate: 0.95
- Language: English
matcha_vctk
- Dataset: VCTK (108 speakers)
- Recommended vocoder:
hifigan_univ_v1 - Default speaking rate: 0.85
- Speaker IDs: 0-107
- Language: English (various accents)
Error Handling
Common Issues
“Either text or file must be provided”Source Reference
CLI Implementation:matcha/cli.py:208
Entry Points: setup.py:44