Overview
TheDecoder is a 1D U-Net architecture that serves as the estimator network for the conditional flow matching process. It predicts the velocity field needed to transform noise into mel-spectrograms.
Class Definition
Decoder
Constructor Parameters
Number of input channels (concatenated noise, mean, and optionally speaker embeddings)
Number of output channels (mel-spectrogram features, typically 80)
Hidden channel dimensions for each level of the U-Net. Length determines depth
Dropout probability in transformer blocks
Dimension of each attention head
Number of transformer/conformer blocks at each level
Number of middle blocks (bottleneck)
Number of attention heads
Activation function: “snake”, “swish”, “mish”, “gelu”
Type of blocks in downsampling path: “transformer” or “conformer”
Type of blocks in middle (bottleneck): “transformer” or “conformer”
Type of blocks in upsampling path: “transformer” or “conformer”
Methods
forward()
Forward pass through the U-Net decoder.Parameters
Noisy mel-spectrogram at timestep tShape:
(batch_size, in_channels, time)Mask for valid time stepsShape:
(batch_size, 1, time)Mean from encoder (conditioning)Shape:
(batch_size, n_feats, time)Current timestep (0 to 1) for flow matchingShape:
(batch_size,)Speaker embeddings for multi-speaker modelsShape:
(batch_size, spk_emb_dim)Additional conditioning (reserved for future use)
Returns
Predicted velocity field (flow direction)Shape:
(batch_size, out_channels, time)Architecture Components
ResnetBlock1D
Residual block with time embedding.Downsample1D
Downsampling layer using strided convolution.Upsample1D
Upsampling layer using transposed convolution or interpolation.Parameters
Number of input channels
Use convolution after interpolation
Use transposed convolution for upsampling
Number of output channels (defaults to input channels)
SinusoidalPosEmb
Sinusoidal positional embeddings for time encoding.Parameters
Embedding dimension (must be even)
TimestepEmbedding
MLP for processing time embeddings.Parameters
Input dimension from sinusoidal embeddings
Output embedding dimension
Activation function
Optional different output dimension
Optional activation after second linear layer
Optional conditioning projection dimension
ConformerWrapper
Wrapper for Conformer blocks (alternative to Transformer).U-Net Architecture
The decoder follows a U-Net structure:- Time Embedding: Sinusoidal embeddings + MLP
- Input Processing: Concatenate x, mu, and optionally speaker embeddings
- Downsampling Path: ResNet blocks + Transformer/Conformer + Downsample
- Middle Blocks: Multiple ResNet + Transformer/Conformer blocks
- Upsampling Path: ResNet blocks + Transformer/Conformer + Upsample with skip connections
- Output: Final convolution to output channels
Example Usage
Configuration Examples
Lightweight Configuration
Heavy Configuration
Conformer-based Configuration
Source Reference
Implementation:matcha/models/components/decoder.py:200