Overview
TheMatchaTTS class is the core model that implements fast text-to-speech synthesis using conditional flow matching. It combines a text encoder, duration predictor, and flow matching decoder to generate high-quality mel-spectrograms from text.
Class Definition
MatchaTTS
Constructor Parameters
Number of symbols in the vocabulary (phoneme set)
Number of speakers. Set to 1 for single-speaker models, >1 for multi-speaker models
Dimension of speaker embeddings
Number of mel-spectrogram channels (typically 80)
Text encoder configuration containing:
encoder_type: Type of encoderencoder_params: Encoder architecture parametersduration_predictor_params: Duration predictor parameters
Decoder (U-Net) configuration parameters
Conditional Flow Matching parameters including solver configuration
Statistics for mel-spectrogram normalization:
mel_mean: Mean value for normalizationmel_std: Standard deviation for normalization
Output segment size for training (enables segment-based training)
Optimizer configuration
Learning rate scheduler configuration
Whether to compute prior loss during training
Use precomputed durations instead of MAS alignment
Methods
synthesise()
Generates mel-spectrogram from text input.Parameters
Batch of texts converted to phoneme embedding IDsShape:
(batch_size, max_text_length)Lengths of texts in batchShape:
(batch_size,)Number of ODE solver steps for flow matching. Higher values produce better quality but slower synthesis. Typical range: 4-20
Controls variance of the terminal distribution. Higher values increase diversity but may reduce quality. Range: 0.0-2.0
Speaker IDs for multi-speaker modelsShape:
(batch_size,)Controls speech pace. Values greater than 1.0 slow down speech, less than 1.0 speed up speech
Returns
Average mel-spectrogram generated by the encoderShape:
(batch_size, n_feats, max_mel_length)Refined mel-spectrogram improved by the CFM decoderShape:
(batch_size, n_feats, max_mel_length)Alignment map between text and mel-spectrogramShape:
(batch_size, max_text_length, max_mel_length)Denormalized mel-spectrogram ready for vocoderShape:
(batch_size, n_feats, max_mel_length)Actual lengths of generated mel-spectrogramsShape:
(batch_size,)Real-time factor (lower is faster). RTF < 1.0 means faster than real-time
Example
forward()
Training forward pass that computes losses.Parameters
Batch of texts converted to phoneme embedding IDsShape:
(batch_size, max_text_length)Lengths of texts in batchShape:
(batch_size,)Batch of corresponding mel-spectrogramsShape:
(batch_size, n_feats, max_mel_length)Lengths of mel-spectrograms in batchShape:
(batch_size,)Speaker IDs for multi-speaker modelsShape:
(batch_size,)Length of segment to cut for decoder training. Should be divisible by 2^(num_downsamplings)
Additional conditioning (reserved for future use)
Precomputed durations when
use_precomputed_durations=TrueReturns
Duration prediction loss
Prior loss between encoder outputs and target mel-spectrogram
Flow matching loss from the decoder
Attention alignment computed by Monotonic Alignment SearchShape:
(batch_size, max_text_length, max_mel_length)Loading Pretrained Models
Source Reference
Implementation:matcha/models/matcha_tts.py:23