Quick Start
This guide will help you start synthesizing speech with Matcha-TTS in minutes. We’ll cover three ways to use Matcha-TTS: the command-line interface (CLI), Python API, and web interface.
Make sure you’ve installed Matcha-TTS before proceeding. Pre-trained models will be automatically downloaded on first use.
CLI Usage
The command-line interface is the fastest way to synthesize speech from text.
Basic Synthesis
Synthesize a single utterance:
matcha-tts --text "Hello, this is Matcha TTS speaking."
This generates utterance_001.wav in your current directory.
Synthesize from File
Create a text file with one sentence per line:
matcha-tts --file sentences.txt
Each line will be synthesized as a separate audio file.
Batch Processing
For faster processing of multiple sentences, use batch mode:
matcha-tts --file sentences.txt --batched --batch_size 32
Batched processing is significantly faster when synthesizing many sentences, especially on GPU.
CLI Parameters
Text to synthesize (alternative to —file)
Path to text file with one sentence per line
--model
string
default: "matcha_ljspeech"
Model to use: matcha_ljspeech (single speaker) or matcha_vctk (multi-speaker)
Path to custom model checkpoint (optional)
Vocoder to use: hifigan_T2_v1 or hifigan_univ_v1 (auto-selected based on model)
Speaking rate control (higher = slower). Default: 0.95 for LJSpeech, 0.85 for VCTK
Sampling temperature for variation (higher = more variation)
Number of ODE solver steps (2-100). Fewer steps = faster but potentially lower quality
Speaker ID for multi-speaker models (0-107 for VCTK)
Directory to save output files (default: current directory)
Force CPU inference (default: use GPU if available)
Enable batch processing mode
Batch size for batch mode
Advanced CLI Examples
Adjust Speaking Rate
Control Quality vs Speed
Multi-Speaker Synthesis
Custom Output Location
Custom Model
# Slower speech (1.2x slower)
matcha-tts --text "Speak slowly and clearly." --speaking_rate 1.2
# Faster speech (0.8x normal speed)
matcha-tts --text "Speak quickly!" --speaking_rate 0.8
Python API
Use Matcha-TTS directly in your Python code for more control.
Basic Python Example
import torch
import soundfile as sf
from matcha.models.matcha_tts import MatchaTTS
from matcha.text import text_to_sequence
from matcha.utils.utils import intersperse
from matcha.hifigan.config import v1
from matcha.hifigan.models import Generator as HiFiGAN
from matcha.hifigan.env import AttrDict
from matcha.hifigan.denoiser import Denoiser
# Set device
device = torch.device( "cuda" if torch.cuda.is_available() else "cpu" )
# Load Matcha-TTS model
model = MatchaTTS.load_from_checkpoint(
"path/to/matcha_ljspeech.ckpt" ,
map_location = device
)
model.eval()
# Load vocoder (HiFi-GAN)
h = AttrDict(v1)
vocoder = HiFiGAN(h).to(device)
vocoder.load_state_dict(
torch.load( "path/to/hifigan_T2_v1" , map_location = device)[ "generator" ]
)
vocoder.eval()
vocoder.remove_weight_norm()
denoiser = Denoiser(vocoder, mode = "zeros" )
# Prepare text
text = "Hello, this is Matcha TTS."
x = torch.tensor(
intersperse(text_to_sequence(text, [ "english_cleaners2" ])[ 0 ], 0 ),
dtype = torch.long,
device = device
)[ None ]
x_lengths = torch.tensor([x.shape[ - 1 ]], dtype = torch.long, device = device)
# Synthesize
with torch.inference_mode():
output = model.synthesise(
x,
x_lengths,
n_timesteps = 10 ,
temperature = 0.667 ,
spks = None ,
length_scale = 1.0
)
# Generate waveform
audio = vocoder(output[ "mel" ]).clamp( - 1 , 1 )
audio = denoiser(audio.squeeze(), strength = 0.00025 ).cpu().squeeze()
# Save audio
sf.write( "output.wav" , audio.numpy(), 22050 , "PCM_24" )
print ( f "Real-time factor: { output[ 'rtf' ] :.4f} " )
Synthesis Function Parameters
The synthesise() method accepts the following parameters:
Batch of phoneme sequences. Shape: (batch_size, max_text_length)
Lengths of each sequence in the batch. Shape: (batch_size,)
Number of ODE solver steps (2-100)
Controls variance of terminal distribution
Speaker IDs for multi-speaker models. Shape: (batch_size,)
Controls speech pace (higher = slower)
Helper Functions
@torch.inference_mode ()
def process_text ( text : str , device ):
"""Convert text to phoneme tensor."""
x = torch.tensor(
intersperse(text_to_sequence(text, [ "english_cleaners2" ])[ 0 ], 0 ),
dtype = torch.long,
device = device
)[ None ]
x_lengths = torch.tensor([x.shape[ - 1 ]], dtype = torch.long, device = device)
return x, x_lengths
@torch.inference_mode ()
def to_waveform ( mel , vocoder , denoiser , strength = 0.00025 ):
"""Convert mel-spectrogram to waveform."""
audio = vocoder(mel).clamp( - 1 , 1 )
audio = denoiser(audio.squeeze(), strength = strength).cpu().squeeze()
return audio
Multi-Speaker Example
# Load VCTK multi-speaker model
model = MatchaTTS.load_from_checkpoint(
"path/to/matcha_vctk.ckpt" ,
map_location = device
)
model.eval()
# Prepare text
x, x_lengths = process_text( "Hello from speaker zero." , device)
# Speaker ID (0-107 for VCTK)
spk = torch.tensor([ 0 ], device = device, dtype = torch.long)
# Synthesize with specific speaker
with torch.inference_mode():
output = model.synthesise(
x,
x_lengths,
n_timesteps = 10 ,
temperature = 0.667 ,
spks = spk,
length_scale = 0.85 # VCTK default
)
Gradio Web Interface
Launch an interactive web interface for experimenting with Matcha-TTS:
This starts a Gradio interface where you can:
Enter text and synthesize instantly
Switch between single-speaker and multi-speaker models
Adjust hyperparameters in real-time
Select different speakers (for VCTK model)
Listen to pre-cached examples
The Gradio app automatically downloads required models on first launch. The interface will be available at http://localhost:7860 by default.
Gradio Interface Code
The Gradio app implementation from matcha/app.py:
import gradio as gr
import torch
import soundfile as sf
from matcha.cli import (
load_matcha,
load_vocoder,
process_text,
to_waveform,
)
# Load models
device = torch.device( "cuda" if torch.cuda.is_available() else "cpu" )
model = load_matcha( "matcha_vctk" , "path/to/model.ckpt" , device)
vocoder, denoiser = load_vocoder( "hifigan_univ_v1" , "path/to/vocoder" , device)
@torch.inference_mode ()
def synthesise_mel ( text , text_length , n_timesteps , temperature , length_scale , spk ):
spk = torch.tensor([spk], device = device, dtype = torch.long) if spk >= 0 else None
output = model.synthesise(
text,
text_length,
n_timesteps = n_timesteps,
temperature = temperature,
spks = spk,
length_scale = length_scale,
)
output[ "waveform" ] = to_waveform(output[ "mel" ], vocoder, denoiser)
return output[ "waveform" ], output[ "mel" ]
Jupyter Notebook
Matcha-TTS includes a Jupyter notebook (synthesis.ipynb) for interactive experimentation:
# From synthesis.ipynb
import datetime as dt
import IPython.display as ipd
import numpy as np
from tqdm.auto import tqdm
# Configuration
n_timesteps = 10
length_scale = 1.0
temperature = 0.667
# Synthesize and display
texts = [
"The Secret Service believed that it was very doubtful that any "
"President would ride regularly in a vehicle with a fixed top, "
"even though transparent."
]
for i, text in enumerate (tqdm(texts)):
output = synthesise(text)
output[ 'waveform' ] = to_waveform(output[ 'mel' ], vocoder)
# Calculate RTF
t = (dt.datetime.now() - output[ 'start_t' ]).total_seconds()
rtf_w = t * 22050 / (output[ 'waveform' ].shape[ - 1 ])
print ( f "RTF: { output[ 'rtf' ] :.6f} " )
print ( f "RTF Waveform: { rtf_w :.6f} " )
# Display audio in notebook
ipd.display(ipd.Audio(output[ 'waveform' ], rate = 22050 ))
# Save to file
save_to_folder(i, output, "synth_output" )
Choosing the right number of steps
2-4 steps : Ultra-fast, slight quality reduction
10 steps (default): Good balance of speed and quality
50+ steps : Highest quality, diminishing returns beyond 50
GPU is highly recommended:
GPU : RTF ~0.02 (50x real-time)
CPU : RTF ~0.5-1.0 (1-2x real-time)
Use --cpu flag only if GPU is unavailable.
For many utterances, use --batched mode: matcha-tts --file large_file.txt --batched --batch_size 32
This can be 3-5x faster than processing individually.
Temperature and variation
0.333 : Less variation, more consistent
0.667 (default): Natural variation
1.0+ : More variation, potentially less stable
Matcha-TTS generates:
Audio files : .wav format, 22050 Hz, PCM_24
Mel-spectrograms : .npy files (NumPy arrays)
Visualizations : .png spectrogram plots (when using CLI)
Next Steps
Training Custom Models Learn how to train Matcha-TTS on your own dataset
ONNX Export Export models to ONNX for deployment
API Reference Detailed API documentation
Examples More advanced usage examples