Overview
The Conditional Flow Matching (CFM) module implements the core generative component of Matcha-TTS. It learns to transform noise into high-quality mel-spectrograms conditioned on encoder outputs, using optimal transport conditional flow matching.Class Definition
CFM
Constructor Parameters
Input channels to the decoder (typically 2 * n_feats for concatenating noise and mean)
Output channels (number of mel-spectrogram features, typically 80)
CFM configuration parameters:
solver: ODE solver type (“euler”)sigma_min: Minimum noise level (default: 1e-4)
Decoder network parameters (U-Net architecture)
Number of speakers
Speaker embedding dimension
Methods
forward()
Generates mel-spectrogram using the flow matching process (inference mode).Parameters
Output of the text encoder (mean latent representation)Shape:
(batch_size, n_feats, mel_timesteps)Output mask for valid framesShape:
(batch_size, 1, mel_timesteps)Number of ODE solver steps. More steps = higher quality but slower. Typical range: 4-20
Temperature for scaling initial noise. Higher = more diverse output
Speaker embeddings for multi-speaker modelsShape:
(batch_size, spk_emb_dim)Additional conditioning (reserved for future use)
Returns
Generated mel-spectrogramShape:
(batch_size, n_feats, mel_timesteps)compute_loss()
Computes the conditional flow matching loss during training.Parameters
Target mel-spectrogram (ground truth)Shape:
(batch_size, n_feats, mel_timesteps)Target mask for valid framesShape:
(batch_size, 1, mel_timesteps)Output of the encoder (mean)Shape:
(batch_size, n_feats, mel_timesteps)Speaker embeddingsShape:
(batch_size, spk_emb_dim)Additional conditioning (reserved for future use)
Returns
Conditional flow matching loss (MSE between predicted and target flow)
Intermediate noisy sample at random timestepShape:
(batch_size, n_feats, mel_timesteps)solve_euler()
Euler method ODE solver for the probability flow.Parameters
Initial noise sampleShape:
(batch_size, n_feats, mel_timesteps)Time steps for ODE solver from 0 to 1Shape:
(n_timesteps + 1,)Encoder output (conditioning)Shape:
(batch_size, n_feats, mel_timesteps)Output maskShape:
(batch_size, 1, mel_timesteps)Speaker embeddings
Additional conditioning
Returns
Final generated mel-spectrogram after solving ODEShape:
(batch_size, n_feats, mel_timesteps)Base Class
BASECFM
Abstract base class for conditional flow matching implementations.Parameters
Number of mel-spectrogram features
CFM parameters including solver configuration
Number of speakers
Speaker embedding dimension
Flow Matching Details
The conditional flow matching loss is computed as:Example Usage
Source Reference
Implementation:matcha/models/components/flow_matching.py:121