Matcha-TTS Architecture
Matcha-TTS is a fast, probabilistic, non-autoregressive text-to-speech model that uses Conditional Flow Matching (CFM) for mel-spectrogram generation. The architecture consists of three main components:- Text Encoder - Encodes phoneme sequences and predicts durations
- Duration Aligner - Uses Monotonic Alignment Search (MAS) for text-to-speech alignment
- Conditional Flow Matching Decoder - Generates high-quality mel-spectrograms
High-Level Architecture
Core Components
MatchaTTS Class
The main model class defined inmatcha_tts.py:23 brings together all components:
n_vocab: Size of phoneme vocabularyn_spks: Number of speakers (1 for single-speaker models)spk_emb_dim: Dimension of speaker embeddings (default: 64)n_feats: Number of mel-spectrogram features (typically 80)out_size: Segment size for training (enables larger batch sizes)
Model Initialization
The model initializes three key components inmatcha_tts.py:55-71:
1. Speaker Embedding (Multi-speaker only)
The decoder input has
2 * n_feats channels because it concatenates the encoder output mu_y with the noisy sample during flow matching.Training Process
The training forward pass (matcha_tts.py:153) computes three losses:
1. Duration Loss
Compares predicted durations with those extracted by Monotonic Alignment Search:model.py:44:
2. Prior Loss
Measures the distance between encoder outputs and target mel-spectrograms (matcha_tts.py:239-242):
3. Flow Matching Loss
The main reconstruction loss from the CFM decoder (matcha_tts.py:237):
The total training loss is: Total Loss = Duration Loss + Prior Loss + Flow Matching Loss
Inference Process
Thesynthesise method (matcha_tts.py:76) generates mel-spectrograms from text:
Step 1: Encode text and predict durations
Key Inference Parameters:
n_timesteps: Number of ODE solver steps (10-50, higher = better quality but slower)temperature: Controls variance (1.0 = normal, greater than 1.0 = more diverse, less than 1.0 = more deterministic)length_scale: Speech rate control (greater than 1.0 = slower, less than 1.0 = faster)
Monotonic Alignment Search (MAS)
MAS finds the optimal alignment between text and mel-spectrogram during training (matcha_tts.py:189-198):
Segment-based Training
To enable larger batch sizes, the model can train on random segments (matcha_tts.py:208-230):
Output Format
Thesynthesise method returns a dictionary with:
Real-Time Factor (RTF) measures inference speed. RTF < 1.0 means faster than real-time. Calculated as:where 22050 is the sample rate and 256 is the hop length.
Multi-Speaker Support
For multi-speaker models, speaker embeddings are added to both encoder and decoder:Related Components
- Text Encoder - Detailed encoder architecture
- Flow Matching - Conditional flow matching algorithm
- Decoder - U-Net based decoder architecture