Skip to main content

Overview

Matcha-TTS provides several command-line tools for text-to-speech synthesis, data preprocessing, and model utilities. All commands are available after installing the package.

Installation

pip install matcha-tts

Commands

matcha-tts

Main command for text-to-speech synthesis.
matcha-tts [OPTIONS]

Options

--model
str
default:"matcha_ljspeech"
Model to use for synthesisChoices: matcha_ljspeech, matcha_vctk
--checkpoint_path
str
default:"None"
Path to a custom model checkpoint (overrides —model)
--vocoder
str
default:"None"
Vocoder to use (defaults to the recommended vocoder for the model)Choices: hifigan_T2_v1, hifigan_univ_v1
--text
str
default:"None"
Text to synthesize (single utterance)
--file
str
default:"None"
Path to text file with utterances (one per line)
--spk
int
default:"None"
Speaker ID for multi-speaker models (0-107 for matcha_vctk)
--temperature
float
default:"0.667"
Variance of the noise. Higher = more diverse output. Range: 0.0-2.0
--speaking_rate
float
default:"None"
Speaking rate control. Higher = slower speech. Default: 1.0 for custom models, 0.95 for ljspeech, 0.85 for vctk
--steps
int
default:"10"
Number of ODE solver steps. More steps = better quality but slower. Typical range: 4-20
--cpu
flag
Force CPU inference (default: use GPU if available)
--denoiser_strength
float
default:"0.00025"
Strength of the vocoder bias denoiser
--output_folder
str
default:"."
Output folder to save synthesized audio and mel-spectrograms
--batched
flag
Enable batched inference for processing multiple utterances
--batch_size
int
default:"32"
Batch size for batched inference

Examples

Single speaker synthesis:
matcha-tts --text "Hello, this is a test of Matcha TTS." \
  --model matcha_ljspeech \
  --steps 10 \
  --output_folder ./outputs
Multi-speaker synthesis:
matcha-tts --text "Hello, how are you today?" \
  --model matcha_vctk \
  --spk 5 \
  --temperature 0.667 \
  --steps 10
Batch synthesis from file:
matcha-tts --file utterances.txt \
  --model matcha_ljspeech \
  --batched \
  --batch_size 16 \
  --output_folder ./batch_outputs
Custom model with specific settings:
matcha-tts --checkpoint_path ./my_model.ckpt \
  --text "Testing custom model." \
  --vocoder hifigan_univ_v1 \
  --speaking_rate 0.9 \
  --temperature 0.5 \
  --steps 15
CPU inference:
matcha-tts --text "Running on CPU." \
  --model matcha_ljspeech \
  --cpu

matcha-data-stats

Compute mel-spectrogram statistics for dataset normalization.
matcha-data-stats [OPTIONS]

Options

-i, --input-config
str
default:"vctk.yaml"
Name of the YAML config file under configs/data/
-b, --batch-size
int
default:"256"
Batch size for computation (higher = faster)
-f, --force
flag
Force overwrite existing output file

Output

Creates a JSON file with mel-spectrogram statistics:
{
  "mel_mean": -5.5345,
  "mel_std": 2.1234
}

Example

matcha-data-stats -i ljspeech.yaml -b 512

matcha-tts-get-durations

Extract phoneme durations from a trained model using Monotonic Alignment Search.
matcha-tts-get-durations [OPTIONS]

Options

-i, --input-config
str
default:"ljspeech.yaml"
Name of the YAML config file under configs/data/
-c, --checkpoint_path
str
required
Path to the trained model checkpoint
-b, --batch-size
int
default:"32"
Batch size for processing
-o, --output-folder
str
default:"None"
Output folder for durations (defaults to data_path/durations/)
-f, --force
flag
Force overwrite existing duration files
--cpu
flag
Use CPU instead of GPU (not recommended)

Output

For each audio file, generates:
  • filename.npy: NumPy array of phoneme durations
  • filename.json: JSON with phoneme-duration pairs

Example

matcha-tts-get-durations \
  -i ljspeech.yaml \
  -c checkpoints/best_model.ckpt \
  -o ./durations \
  -b 64

matcha-tts-app

Launch interactive Gradio web interface for synthesis.
matcha-tts-app
This command starts a web interface where you can:
  • Select pre-trained models
  • Choose speakers (for multi-speaker models)
  • Adjust synthesis parameters
  • Type or paste text
  • Generate and play audio

Example

matcha-tts-app
# Opens web interface at http://localhost:7860

Output Files

Audio Files

Generated as utterance_XXX_speaker_YYY.wav (or utterance_XXX.wav for single-speaker):
  • Format: WAV
  • Sample rate: 22050 Hz
  • Bit depth: 24-bit PCM

Mel-Spectrogram Files

Saved alongside audio as:
  • utterance_XXX.npy: NumPy array of mel-spectrogram
  • utterance_XXX.png: Visualization of mel-spectrogram

Environment Variables

Matcha-TTS uses the following directories:
  • Model cache: ~/.local/share/matcha_tts/ (Linux/Mac) or %LOCALAPPDATA%\matcha_tts\ (Windows)
  • Downloaded models are cached here automatically

Performance Tips

Speed Optimization

  1. Reduce ODE steps: Use --steps 4 for faster synthesis (slight quality loss)
  2. Enable batched inference: Use --batched for multiple utterances
  3. GPU acceleration: Ensure CUDA is available (much faster than CPU)

Quality Optimization

  1. Increase ODE steps: Use --steps 15-20 for better quality
  2. Adjust temperature: Lower values (0.3-0.5) for more consistent output
  3. Fine-tune speaking rate: Adjust --speaking_rate for natural pacing

Pretrained Models

matcha_ljspeech

  • Dataset: LJ Speech (single female speaker)
  • Recommended vocoder: hifigan_T2_v1
  • Default speaking rate: 0.95
  • Language: English

matcha_vctk

  • Dataset: VCTK (108 speakers)
  • Recommended vocoder: hifigan_univ_v1
  • Default speaking rate: 0.85
  • Speaker IDs: 0-107
  • Language: English (various accents)

Error Handling

Common Issues

“Either text or file must be provided”
# Must specify --text OR --file
matcha-tts --text "Hello world"
“Sampling temperature cannot be negative”
# Temperature must be >= 0
matcha-tts --text "Test" --temperature 0.667
“Speaker ID must be between 0 and 107”
# For matcha_vctk, use valid speaker ID
matcha-tts --model matcha_vctk --spk 5 --text "Test"

Source Reference

CLI Implementation: matcha/cli.py:208 Entry Points: setup.py:44

Build docs developers (and LLMs) love