Skip to main content

Overview

The TextEncoder encodes input text (phoneme sequences) into latent representations and predicts phoneme durations. It uses a transformer-based architecture with rotary positional embeddings and includes an optional prenet and a duration predictor.

Class Definition

TextEncoder

from matcha.models.components.text_encoder import TextEncoder

encoder = TextEncoder(
    encoder_type="transformer",
    encoder_params=encoder_config,
    duration_predictor_params=duration_config,
    n_vocab=148,
    n_spks=1,
    spk_emb_dim=128
)

Constructor Parameters

encoder_type
str
required
Type of encoder architecture (typically “transformer”)
encoder_params
object
required
Encoder architecture parameters:
  • n_feats: Number of mel features
  • n_channels: Hidden dimension size
  • n_heads: Number of attention heads
  • n_layers: Number of transformer layers
  • filter_channels: Feedforward network dimension
  • kernel_size: Convolution kernel size
  • p_dropout: Dropout probability
  • prenet: Whether to use convolutional prenet
duration_predictor_params
object
required
Duration predictor parameters:
  • filter_channels_dp: Filter channels for duration predictor
  • kernel_size: Convolution kernel size
  • p_dropout: Dropout probability
n_vocab
int
required
Vocabulary size (number of phoneme symbols)
n_spks
int
default:"1"
Number of speakers
spk_emb_dim
int
default:"128"
Speaker embedding dimension

Methods

forward()

Runs forward pass through the encoder and duration predictor.
def forward(
    x: torch.Tensor,
    x_lengths: torch.Tensor,
    spks: torch.Tensor = None
) -> tuple

Parameters

x
torch.Tensor
required
Text input as phoneme IDsShape: (batch_size, max_text_length)
x_lengths
torch.Tensor
required
Text input lengthsShape: (batch_size,)
spks
torch.Tensor
default:"None"
Speaker IDs for multi-speaker modelsShape: (batch_size,)

Returns

mu
torch.Tensor
Average encoder output (mean latent representation)Shape: (batch_size, n_feats, max_text_length)
logw
torch.Tensor
Log-scaled duration predictionsShape: (batch_size, 1, max_text_length)
x_mask
torch.Tensor
Mask for the text inputShape: (batch_size, 1, max_text_length)

Components

Encoder

Transformer encoder with multi-head attention and feedforward networks.
from matcha.models.components.text_encoder import Encoder

encoder = Encoder(
    hidden_channels=192,
    filter_channels=768,
    n_heads=2,
    n_layers=6,
    kernel_size=3,
    p_dropout=0.1
)

Parameters

hidden_channels
int
required
Hidden dimension size
filter_channels
int
required
Feedforward network dimension
n_heads
int
required
Number of attention heads
n_layers
int
required
Number of transformer layers
kernel_size
int
default:"1"
Convolution kernel size in FFN
p_dropout
float
default:"0.0"
Dropout probability

DurationPredictor

Predicts phoneme durations using convolutional layers.
from matcha.models.components.text_encoder import DurationPredictor

duration_predictor = DurationPredictor(
    in_channels=192,
    filter_channels=256,
    kernel_size=3,
    p_dropout=0.5
)

Parameters

in_channels
int
required
Input dimension
filter_channels
int
required
Number of filter channels
kernel_size
int
required
Convolution kernel size
p_dropout
float
required
Dropout probability

MultiHeadAttention

Multi-head attention with rotary positional embeddings.
from matcha.models.components.text_encoder import MultiHeadAttention

attn = MultiHeadAttention(
    channels=192,
    out_channels=192,
    n_heads=2,
    heads_share=True,
    p_dropout=0.1,
    proximal_bias=False,
    proximal_init=False
)

Parameters

channels
int
required
Input channel dimension
out_channels
int
required
Output channel dimension
n_heads
int
required
Number of attention heads
heads_share
bool
default:"True"
Whether attention heads share parameters
p_dropout
float
default:"0.0"
Dropout probability
proximal_bias
bool
default:"False"
Use proximal bias in attention
proximal_init
bool
default:"False"
Initialize query and key projections similarly

RotaryPositionalEmbeddings

Rotary positional embeddings (RoPE) for improved position encoding.
from matcha.models.components.text_encoder import RotaryPositionalEmbeddings

rope = RotaryPositionalEmbeddings(
    d=64,  # half of the feature dimension
    base=10000
)

Parameters

d
int
required
Number of features (should be even). Half the actual feature dimension
base
int
default:"10000"
Base value for calculating rotation angles

Example Usage

import torch
from matcha.models.components.text_encoder import TextEncoder
from types import SimpleNamespace

# Configure encoder
encoder_params = SimpleNamespace(
    n_feats=80,
    n_channels=192,
    n_heads=2,
    n_layers=6,
    filter_channels=768,
    kernel_size=3,
    p_dropout=0.1,
    prenet=True
)

duration_params = SimpleNamespace(
    filter_channels_dp=256,
    kernel_size=3,
    p_dropout=0.5
)

# Create encoder
encoder = TextEncoder(
    encoder_type="transformer",
    encoder_params=encoder_params,
    duration_predictor_params=duration_params,
    n_vocab=148,
    n_spks=1,
    spk_emb_dim=128
)

# Example input
x = torch.randint(0, 148, (2, 50))  # batch_size=2, max_length=50
x_lengths = torch.LongTensor([50, 35])

# Forward pass
mu, logw, x_mask = encoder(x, x_lengths)

print(f"Encoder output shape: {mu.shape}")  # (2, 80, 50)
print(f"Duration prediction shape: {logw.shape}")  # (2, 1, 50)

Source Reference

Implementation: matcha/models/components/text_encoder.py:328

Build docs developers (and LLMs) love