Skip to main content

Quick Start

This guide will help you start synthesizing speech with Matcha-TTS in minutes. We’ll cover three ways to use Matcha-TTS: the command-line interface (CLI), Python API, and web interface.
Make sure you’ve installed Matcha-TTS before proceeding. Pre-trained models will be automatically downloaded on first use.

CLI Usage

The command-line interface is the fastest way to synthesize speech from text.

Basic Synthesis

Synthesize a single utterance:
matcha-tts --text "Hello, this is Matcha TTS speaking."
This generates utterance_001.wav in your current directory.

Synthesize from File

Create a text file with one sentence per line:
matcha-tts --file sentences.txt
Each line will be synthesized as a separate audio file.

Batch Processing

For faster processing of multiple sentences, use batch mode:
matcha-tts --file sentences.txt --batched --batch_size 32
Batched processing is significantly faster when synthesizing many sentences, especially on GPU.

CLI Parameters

--text
string
Text to synthesize (alternative to —file)
--file
string
Path to text file with one sentence per line
--model
string
default:"matcha_ljspeech"
Model to use: matcha_ljspeech (single speaker) or matcha_vctk (multi-speaker)
--checkpoint_path
string
Path to custom model checkpoint (optional)
--vocoder
string
Vocoder to use: hifigan_T2_v1 or hifigan_univ_v1 (auto-selected based on model)
--speaking_rate
float
default:"0.95"
Speaking rate control (higher = slower). Default: 0.95 for LJSpeech, 0.85 for VCTK
--temperature
float
default:"0.667"
Sampling temperature for variation (higher = more variation)
--steps
int
default:"10"
Number of ODE solver steps (2-100). Fewer steps = faster but potentially lower quality
--spk
int
Speaker ID for multi-speaker models (0-107 for VCTK)
--output_folder
string
Directory to save output files (default: current directory)
--cpu
boolean
Force CPU inference (default: use GPU if available)
--batched
boolean
Enable batch processing mode
--batch_size
int
default:"32"
Batch size for batch mode

Advanced CLI Examples

# Slower speech (1.2x slower)
matcha-tts --text "Speak slowly and clearly." --speaking_rate 1.2

# Faster speech (0.8x normal speed)
matcha-tts --text "Speak quickly!" --speaking_rate 0.8

Python API

Use Matcha-TTS directly in your Python code for more control.

Basic Python Example

import torch
import soundfile as sf
from matcha.models.matcha_tts import MatchaTTS
from matcha.text import text_to_sequence
from matcha.utils.utils import intersperse
from matcha.hifigan.config import v1
from matcha.hifigan.models import Generator as HiFiGAN
from matcha.hifigan.env import AttrDict
from matcha.hifigan.denoiser import Denoiser

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load Matcha-TTS model
model = MatchaTTS.load_from_checkpoint(
    "path/to/matcha_ljspeech.ckpt", 
    map_location=device
)
model.eval()

# Load vocoder (HiFi-GAN)
h = AttrDict(v1)
vocoder = HiFiGAN(h).to(device)
vocoder.load_state_dict(
    torch.load("path/to/hifigan_T2_v1", map_location=device)["generator"]
)
vocoder.eval()
vocoder.remove_weight_norm()
denoiser = Denoiser(vocoder, mode="zeros")

# Prepare text
text = "Hello, this is Matcha TTS."
x = torch.tensor(
    intersperse(text_to_sequence(text, ["english_cleaners2"])[0], 0),
    dtype=torch.long,
    device=device
)[None]
x_lengths = torch.tensor([x.shape[-1]], dtype=torch.long, device=device)

# Synthesize
with torch.inference_mode():
    output = model.synthesise(
        x,
        x_lengths,
        n_timesteps=10,
        temperature=0.667,
        spks=None,
        length_scale=1.0
    )
    
    # Generate waveform
    audio = vocoder(output["mel"]).clamp(-1, 1)
    audio = denoiser(audio.squeeze(), strength=0.00025).cpu().squeeze()

# Save audio
sf.write("output.wav", audio.numpy(), 22050, "PCM_24")

print(f"Real-time factor: {output['rtf']:.4f}")

Synthesis Function Parameters

The synthesise() method accepts the following parameters:
x
torch.Tensor
Batch of phoneme sequences. Shape: (batch_size, max_text_length)
x_lengths
torch.Tensor
Lengths of each sequence in the batch. Shape: (batch_size,)
n_timesteps
int
Number of ODE solver steps (2-100)
temperature
float
default:"1.0"
Controls variance of terminal distribution
spks
torch.Tensor
Speaker IDs for multi-speaker models. Shape: (batch_size,)
length_scale
float
default:"1.0"
Controls speech pace (higher = slower)

Helper Functions

@torch.inference_mode()
def process_text(text: str, device):
    """Convert text to phoneme tensor."""
    x = torch.tensor(
        intersperse(text_to_sequence(text, ["english_cleaners2"])[0], 0),
        dtype=torch.long,
        device=device
    )[None]
    x_lengths = torch.tensor([x.shape[-1]], dtype=torch.long, device=device)
    return x, x_lengths

@torch.inference_mode()
def to_waveform(mel, vocoder, denoiser, strength=0.00025):
    """Convert mel-spectrogram to waveform."""
    audio = vocoder(mel).clamp(-1, 1)
    audio = denoiser(audio.squeeze(), strength=strength).cpu().squeeze()
    return audio

Multi-Speaker Example

# Load VCTK multi-speaker model
model = MatchaTTS.load_from_checkpoint(
    "path/to/matcha_vctk.ckpt", 
    map_location=device
)
model.eval()

# Prepare text
x, x_lengths = process_text("Hello from speaker zero.", device)

# Speaker ID (0-107 for VCTK)
spk = torch.tensor([0], device=device, dtype=torch.long)

# Synthesize with specific speaker
with torch.inference_mode():
    output = model.synthesise(
        x,
        x_lengths,
        n_timesteps=10,
        temperature=0.667,
        spks=spk,
        length_scale=0.85  # VCTK default
    )

Gradio Web Interface

Launch an interactive web interface for experimenting with Matcha-TTS:
matcha-tts-app
This starts a Gradio interface where you can:
  • Enter text and synthesize instantly
  • Switch between single-speaker and multi-speaker models
  • Adjust hyperparameters in real-time
  • Select different speakers (for VCTK model)
  • Listen to pre-cached examples
The Gradio app automatically downloads required models on first launch. The interface will be available at http://localhost:7860 by default.

Gradio Interface Code

The Gradio app implementation from matcha/app.py:
import gradio as gr
import torch
import soundfile as sf
from matcha.cli import (
    load_matcha,
    load_vocoder,
    process_text,
    to_waveform,
)

# Load models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = load_matcha("matcha_vctk", "path/to/model.ckpt", device)
vocoder, denoiser = load_vocoder("hifigan_univ_v1", "path/to/vocoder", device)

@torch.inference_mode()
def synthesise_mel(text, text_length, n_timesteps, temperature, length_scale, spk):
    spk = torch.tensor([spk], device=device, dtype=torch.long) if spk >= 0 else None
    output = model.synthesise(
        text,
        text_length,
        n_timesteps=n_timesteps,
        temperature=temperature,
        spks=spk,
        length_scale=length_scale,
    )
    output["waveform"] = to_waveform(output["mel"], vocoder, denoiser)
    return output["waveform"], output["mel"]

Jupyter Notebook

Matcha-TTS includes a Jupyter notebook (synthesis.ipynb) for interactive experimentation:
# From synthesis.ipynb
import datetime as dt
import IPython.display as ipd
import numpy as np
from tqdm.auto import tqdm

# Configuration
n_timesteps = 10
length_scale = 1.0
temperature = 0.667

# Synthesize and display
texts = [
    "The Secret Service believed that it was very doubtful that any "
    "President would ride regularly in a vehicle with a fixed top, "
    "even though transparent."
]

for i, text in enumerate(tqdm(texts)):
    output = synthesise(text)
    output['waveform'] = to_waveform(output['mel'], vocoder)
    
    # Calculate RTF
    t = (dt.datetime.now() - output['start_t']).total_seconds()
    rtf_w = t * 22050 / (output['waveform'].shape[-1])
    
    print(f"RTF: {output['rtf']:.6f}")
    print(f"RTF Waveform: {rtf_w:.6f}")
    
    # Display audio in notebook
    ipd.display(ipd.Audio(output['waveform'], rate=22050))
    
    # Save to file
    save_to_folder(i, output, "synth_output")

Performance Tips

  • 2-4 steps: Ultra-fast, slight quality reduction
  • 10 steps (default): Good balance of speed and quality
  • 50+ steps: Highest quality, diminishing returns beyond 50
GPU is highly recommended:
  • GPU: RTF ~0.02 (50x real-time)
  • CPU: RTF ~0.5-1.0 (1-2x real-time)
Use --cpu flag only if GPU is unavailable.
For many utterances, use --batched mode:
matcha-tts --file large_file.txt --batched --batch_size 32
This can be 3-5x faster than processing individually.
  • 0.333: Less variation, more consistent
  • 0.667 (default): Natural variation
  • 1.0+: More variation, potentially less stable

Output Format

Matcha-TTS generates:
  • Audio files: .wav format, 22050 Hz, PCM_24
  • Mel-spectrograms: .npy files (NumPy arrays)
  • Visualizations: .png spectrogram plots (when using CLI)

Next Steps

Training Custom Models

Learn how to train Matcha-TTS on your own dataset

ONNX Export

Export models to ONNX for deployment

API Reference

Detailed API documentation

Examples

More advanced usage examples

Build docs developers (and LLMs) love