Skip to main content

Overview

Matcha-TTS provides a powerful command-line interface for text-to-speech synthesis. The CLI supports both single-speaker and multi-speaker models, batch processing, and extensive customization of synthesis parameters.

Basic Usage

matcha-tts --text "Hello, world! This is Matcha-TTS."

Available Models

Matcha-TTS comes with two pre-trained models:
  • matcha_ljspeech: Single-speaker model trained on LJ Speech dataset
  • matcha_vctk: Multi-speaker model trained on VCTK dataset (108 speakers, IDs 0-107)

CLI Arguments

Model Selection

--model
string
default:"matcha_ljspeech"
Model to use for synthesis. Choices: matcha_ljspeech, matcha_vctkThe model determines the voice quality and available speakers.
--checkpoint_path
string
default:"None"
Path to a custom model checkpoint. Use this to load your own trained models.When using a custom checkpoint, consider using --vocoder hifigan_univ_v1 for best compatibility.
--vocoder
string
default:"auto"
Vocoder to use for waveform generation. Choices: hifigan_T2_v1, hifigan_univ_v1Default vocoder is automatically selected based on the model:
  • matcha_ljspeechhifigan_T2_v1
  • matcha_vctkhifigan_univ_v1

Input

--text
string
required
Text to synthesize. Either --text or --file must be provided.Example: --text "The quick brown fox jumps over the lazy dog"
--file
string
required
Path to text file with sentences to synthesize (one per line).When using --file, consider using --batched for faster processing.

Speaker Control

--spk
integer
default:"0"
Speaker ID for multi-speaker models.
  • For matcha_ljspeech: Not applicable (single speaker)
  • For matcha_vctk: Valid range is 0-107
Example: --spk 42

Synthesis Parameters

--temperature
float
default:"0.667"
Variance of the noise distribution (controls prosody variation).
  • Higher values (e.g., 1.0): More expressive but less stable
  • Lower values (e.g., 0.3): More stable but less expressive
  • Must be ≥ 0
See Parameters for detailed guidance.
--speaking_rate
float
default:"auto"
Controls speech pace. Higher values = slower speech.Default values:
  • matcha_ljspeech: 0.95
  • matcha_vctk: 0.85
Typical range: 0.5 (fast) to 1.5 (slow)Must be > 0
--steps
integer
default:"10"
Number of ODE (Ordinary Differential Equation) steps for synthesis.
  • More steps: Higher quality, slower synthesis
  • Fewer steps: Faster synthesis, slight quality reduction
  • Recommended range: 4-50
  • Must be > 0
Example: --steps 20 for higher quality

Vocoder Options

--denoiser_strength
float
default:"0.00025"
Strength of the vocoder bias denoiser.Higher values reduce background noise but may affect audio quality.

Output Options

--output_folder
string
default:"current directory"
Directory where generated audio files will be saved.Example: --output_folder ./outputs

Performance Options

--batched
boolean
default:"false"
Enable batched inference for processing multiple texts.Use this flag when processing files with multiple sentences for significant speedup.
--batch_size
integer
default:"32"
Batch size for batched inference (only used with --batched).Larger batches are faster but require more memory.
--cpu
boolean
default:"false"
Force CPU inference even if GPU is available.By default, GPU is used if available.

Examples

Single Speaker Synthesis

# Basic synthesis with LJ Speech model
matcha-tts --text "Matcha-TTS is a fast TTS architecture with conditional flow matching."

# Higher quality with more steps
matcha-tts --text "High quality synthesis." --steps 50 --temperature 0.667

# Adjust speaking rate
matcha-tts --text "Speaking slowly." --speaking_rate 1.2

Multi-Speaker Synthesis

# Use specific speaker from VCTK
matcha-tts --model matcha_vctk --spk 16 --text "Hello from speaker 16!"

# Different speakers with adjusted parameters
matcha-tts --model matcha_vctk --spk 44 --speaking_rate 0.85 --temperature 0.677 \
  --text "This is speaker 44 speaking."

Batch Processing

# Process multiple sentences from file
matcha-tts --file sentences.txt --batched --batch_size 16

# Batch processing with custom output folder
matcha-tts --file inputs.txt --batched --output_folder ./audio_outputs

Custom Models

# Use your own trained model
matcha-tts --checkpoint_path ./my_model.ckpt \
  --vocoder hifigan_univ_v1 \
  --text "Testing custom model."

Output Files

The CLI generates three files per utterance:
  1. .wav: Audio file (22050 Hz, PCM_24)
  2. .npy: Mel-spectrogram (NumPy format)
  3. .png: Mel-spectrogram visualization
Filename format:
  • Single speaker: utterance_001.wav, utterance_002.wav, …
  • Multi-speaker: utterance_001_speaker_042.wav, …

Performance Metrics

The CLI reports Real-Time Factor (RTF) metrics:
  • Matcha-TTS RTF: Mel-spectrogram generation speed
  • Matcha-TTS + VOCODER RTF: End-to-end synthesis speed
RTF < 1.0 means faster than real-time synthesis.

Tips

  • Start with default parameters and adjust based on your needs
  • Use --batched for files with multiple sentences (10x+ speedup)
  • Try different --steps values (2, 4, 10, 50) to balance quality/speed
  • Use --temperature 0.667 as a good default for natural prosody
  • For custom models, hifigan_univ_v1 vocoder usually works best

Build docs developers (and LLMs) love