Skip to main content

Overview

The MatchaTTS class is the core model that implements fast text-to-speech synthesis using conditional flow matching. It combines a text encoder, duration predictor, and flow matching decoder to generate high-quality mel-spectrograms from text.

Class Definition

MatchaTTS

from matcha.models.matcha_tts import MatchaTTS

model = MatchaTTS(
    n_vocab=148,
    n_spks=1,
    spk_emb_dim=128,
    n_feats=80,
    encoder=encoder_config,
    decoder=decoder_config,
    cfm=cfm_config,
    data_statistics={"mel_mean": 0.0, "mel_std": 1.0},
    out_size=None,
    optimizer=None,
    scheduler=None,
    prior_loss=True,
    use_precomputed_durations=False
)

Constructor Parameters

n_vocab
int
required
Number of symbols in the vocabulary (phoneme set)
n_spks
int
required
Number of speakers. Set to 1 for single-speaker models, >1 for multi-speaker models
spk_emb_dim
int
required
Dimension of speaker embeddings
n_feats
int
required
Number of mel-spectrogram channels (typically 80)
encoder
object
required
Text encoder configuration containing:
  • encoder_type: Type of encoder
  • encoder_params: Encoder architecture parameters
  • duration_predictor_params: Duration predictor parameters
decoder
object
required
Decoder (U-Net) configuration parameters
cfm
object
required
Conditional Flow Matching parameters including solver configuration
data_statistics
dict
required
Statistics for mel-spectrogram normalization:
  • mel_mean: Mean value for normalization
  • mel_std: Standard deviation for normalization
out_size
int
default:"None"
Output segment size for training (enables segment-based training)
optimizer
object
default:"None"
Optimizer configuration
scheduler
object
default:"None"
Learning rate scheduler configuration
prior_loss
bool
default:"True"
Whether to compute prior loss during training
use_precomputed_durations
bool
default:"False"
Use precomputed durations instead of MAS alignment

Methods

synthesise()

Generates mel-spectrogram from text input.
@torch.inference_mode()
def synthesise(
    x: torch.Tensor,
    x_lengths: torch.Tensor,
    n_timesteps: int,
    temperature: float = 1.0,
    spks: torch.Tensor = None,
    length_scale: float = 1.0
) -> dict

Parameters

x
torch.Tensor
required
Batch of texts converted to phoneme embedding IDsShape: (batch_size, max_text_length)
x_lengths
torch.Tensor
required
Lengths of texts in batchShape: (batch_size,)
n_timesteps
int
required
Number of ODE solver steps for flow matching. Higher values produce better quality but slower synthesis. Typical range: 4-20
temperature
float
default:"1.0"
Controls variance of the terminal distribution. Higher values increase diversity but may reduce quality. Range: 0.0-2.0
spks
torch.Tensor
default:"None"
Speaker IDs for multi-speaker modelsShape: (batch_size,)
length_scale
float
default:"1.0"
Controls speech pace. Values greater than 1.0 slow down speech, less than 1.0 speed up speech

Returns

encoder_outputs
torch.Tensor
Average mel-spectrogram generated by the encoderShape: (batch_size, n_feats, max_mel_length)
decoder_outputs
torch.Tensor
Refined mel-spectrogram improved by the CFM decoderShape: (batch_size, n_feats, max_mel_length)
attn
torch.Tensor
Alignment map between text and mel-spectrogramShape: (batch_size, max_text_length, max_mel_length)
mel
torch.Tensor
Denormalized mel-spectrogram ready for vocoderShape: (batch_size, n_feats, max_mel_length)
mel_lengths
torch.Tensor
Actual lengths of generated mel-spectrogramsShape: (batch_size,)
rtf
float
Real-time factor (lower is faster). RTF < 1.0 means faster than real-time

Example

import torch
from matcha.models.matcha_tts import MatchaTTS
from matcha.text import text_to_sequence
from matcha.utils.utils import intersperse

# Load pretrained model
model = MatchaTTS.load_from_checkpoint("matcha_ljspeech.ckpt")
model.eval()

# Prepare text input
text = "Hello, this is a test."
sequence = text_to_sequence(text, ["english_cleaners2"])[0]
x = torch.LongTensor(intersperse(sequence, 0)).unsqueeze(0)
x_lengths = torch.LongTensor([x.shape[1]])

# Generate mel-spectrogram
output = model.synthesise(
    x=x,
    x_lengths=x_lengths,
    n_timesteps=10,
    temperature=0.667,
    length_scale=1.0
)

print(f"Generated mel shape: {output['mel'].shape}")
print(f"Real-time factor: {output['rtf']:.4f}")

forward()

Training forward pass that computes losses.
def forward(
    x: torch.Tensor,
    x_lengths: torch.Tensor,
    y: torch.Tensor,
    y_lengths: torch.Tensor,
    spks: torch.Tensor = None,
    out_size: int = None,
    cond: torch.Tensor = None,
    durations: torch.Tensor = None
) -> tuple

Parameters

x
torch.Tensor
required
Batch of texts converted to phoneme embedding IDsShape: (batch_size, max_text_length)
x_lengths
torch.Tensor
required
Lengths of texts in batchShape: (batch_size,)
y
torch.Tensor
required
Batch of corresponding mel-spectrogramsShape: (batch_size, n_feats, max_mel_length)
y_lengths
torch.Tensor
required
Lengths of mel-spectrograms in batchShape: (batch_size,)
spks
torch.Tensor
default:"None"
Speaker IDs for multi-speaker modelsShape: (batch_size,)
out_size
int
default:"None"
Length of segment to cut for decoder training. Should be divisible by 2^(num_downsamplings)
cond
torch.Tensor
default:"None"
Additional conditioning (reserved for future use)
durations
torch.Tensor
default:"None"
Precomputed durations when use_precomputed_durations=True

Returns

dur_loss
torch.Tensor
Duration prediction loss
prior_loss
torch.Tensor
Prior loss between encoder outputs and target mel-spectrogram
diff_loss
torch.Tensor
Flow matching loss from the decoder
attn
torch.Tensor
Attention alignment computed by Monotonic Alignment SearchShape: (batch_size, max_text_length, max_mel_length)

Loading Pretrained Models

# Load from checkpoint
model = MatchaTTS.load_from_checkpoint(
    "matcha_ljspeech.ckpt",
    map_location="cuda"
)
model.eval()

Source Reference

Implementation: matcha/models/matcha_tts.py:23

Build docs developers (and LLMs) love