Overview
TheTextEncoder encodes input text (phoneme sequences) into latent representations and predicts phoneme durations. It uses a transformer-based architecture with rotary positional embeddings and includes an optional prenet and a duration predictor.
Class Definition
TextEncoder
Constructor Parameters
Type of encoder architecture (typically “transformer”)
Encoder architecture parameters:
n_feats: Number of mel featuresn_channels: Hidden dimension sizen_heads: Number of attention headsn_layers: Number of transformer layersfilter_channels: Feedforward network dimensionkernel_size: Convolution kernel sizep_dropout: Dropout probabilityprenet: Whether to use convolutional prenet
Duration predictor parameters:
filter_channels_dp: Filter channels for duration predictorkernel_size: Convolution kernel sizep_dropout: Dropout probability
Vocabulary size (number of phoneme symbols)
Number of speakers
Speaker embedding dimension
Methods
forward()
Runs forward pass through the encoder and duration predictor.Parameters
Text input as phoneme IDsShape:
(batch_size, max_text_length)Text input lengthsShape:
(batch_size,)Speaker IDs for multi-speaker modelsShape:
(batch_size,)Returns
Average encoder output (mean latent representation)Shape:
(batch_size, n_feats, max_text_length)Log-scaled duration predictionsShape:
(batch_size, 1, max_text_length)Mask for the text inputShape:
(batch_size, 1, max_text_length)Components
Encoder
Transformer encoder with multi-head attention and feedforward networks.Parameters
Hidden dimension size
Feedforward network dimension
Number of attention heads
Number of transformer layers
Convolution kernel size in FFN
Dropout probability
DurationPredictor
Predicts phoneme durations using convolutional layers.Parameters
Input dimension
Number of filter channels
Convolution kernel size
Dropout probability
MultiHeadAttention
Multi-head attention with rotary positional embeddings.Parameters
Input channel dimension
Output channel dimension
Number of attention heads
Whether attention heads share parameters
Dropout probability
Use proximal bias in attention
Initialize query and key projections similarly
RotaryPositionalEmbeddings
Rotary positional embeddings (RoPE) for improved position encoding.Parameters
Number of features (should be even). Half the actual feature dimension
Base value for calculating rotation angles
Example Usage
Source Reference
Implementation:matcha/models/components/text_encoder.py:328