TextEncoder

Overview

The TextEncoder encodes input text (phoneme sequences) into latent representations and predicts phoneme durations. It uses a transformer-based architecture with rotary positional embeddings and includes an optional prenet and a duration predictor.

Class Definition

from matcha.models.components.text_encoder import TextEncoder

encoder = TextEncoder(
    encoder_type="transformer",
    encoder_params=encoder_config,
    duration_predictor_params=duration_config,
    n_vocab=148,
    n_spks=1,
    spk_emb_dim=128
)

Constructor Parameters

encoder_type

str

required

Type of encoder architecture (typically “transformer”)

encoder_params

object

required

Encoder architecture parameters:

n_feats: Number of mel features
n_channels: Hidden dimension size
n_heads: Number of attention heads
n_layers: Number of transformer layers
filter_channels: Feedforward network dimension
kernel_size: Convolution kernel size
p_dropout: Dropout probability
prenet: Whether to use convolutional prenet

duration_predictor_params

object

required

Duration predictor parameters:

filter_channels_dp: Filter channels for duration predictor
kernel_size: Convolution kernel size
p_dropout: Dropout probability

n_vocab

int

required

Vocabulary size (number of phoneme symbols)

n_spks

int

default:"1"

Number of speakers

spk_emb_dim

int

default:"128"

Speaker embedding dimension

Methods

forward()

Runs forward pass through the encoder and duration predictor.

def forward(
    x: torch.Tensor,
    x_lengths: torch.Tensor,
    spks: torch.Tensor = None
) -> tuple

Parameters

torch.Tensor

required

Text input as phoneme IDsShape: (batch_size, max_text_length)

x_lengths

torch.Tensor

required

Text input lengthsShape: (batch_size,)

spks

torch.Tensor

default:"None"

Speaker IDs for multi-speaker modelsShape: (batch_size,)

Returns

torch.Tensor

Average encoder output (mean latent representation)Shape: (batch_size, n_feats, max_text_length)

logw

torch.Tensor

Log-scaled duration predictionsShape: (batch_size, 1, max_text_length)

x_mask

torch.Tensor

Mask for the text inputShape: (batch_size, 1, max_text_length)

Components

Encoder

Transformer encoder with multi-head attention and feedforward networks.

from matcha.models.components.text_encoder import Encoder

encoder = Encoder(
    hidden_channels=192,
    filter_channels=768,
    n_heads=2,
    n_layers=6,
    kernel_size=3,
    p_dropout=0.1
)

Parameters

hidden_channels

int

required

Hidden dimension size

filter_channels

int

required

Feedforward network dimension

n_heads

int

required

Number of attention heads

n_layers

int

required

Number of transformer layers

kernel_size

int

default:"1"

Convolution kernel size in FFN

p_dropout

float

default:"0.0"

Dropout probability

DurationPredictor

Predicts phoneme durations using convolutional layers.

from matcha.models.components.text_encoder import DurationPredictor

duration_predictor = DurationPredictor(
    in_channels=192,
    filter_channels=256,
    kernel_size=3,
    p_dropout=0.5
)

Parameters

in_channels

int

required

Input dimension

filter_channels

int

required

Number of filter channels

kernel_size

int

required

Convolution kernel size

p_dropout

float

required

Dropout probability

MultiHeadAttention

Multi-head attention with rotary positional embeddings.

from matcha.models.components.text_encoder import MultiHeadAttention

attn = MultiHeadAttention(
    channels=192,
    out_channels=192,
    n_heads=2,
    heads_share=True,
    p_dropout=0.1,
    proximal_bias=False,
    proximal_init=False
)

Parameters

channels

int

required

Input channel dimension

out_channels

int

required

Output channel dimension

n_heads

int

required

Number of attention heads

Whether attention heads share parameters

p_dropout

float

default:"0.0"

Dropout probability

proximal_bias

bool

default:"False"

Use proximal bias in attention

proximal_init

bool

default:"False"

Initialize query and key projections similarly

RotaryPositionalEmbeddings

Rotary positional embeddings (RoPE) for improved position encoding.

from matcha.models.components.text_encoder import RotaryPositionalEmbeddings

rope = RotaryPositionalEmbeddings(
    d=64,  # half of the feature dimension
    base=10000
)

Parameters

int

required

Number of features (should be even). Half the actual feature dimension

base

int

default:"10000"

Base value for calculating rotation angles

Example Usage

import torch
from matcha.models.components.text_encoder import TextEncoder
from types import SimpleNamespace

# Configure encoder
encoder_params = SimpleNamespace(
    n_feats=80,
    n_channels=192,
    n_heads=2,
    n_layers=6,
    filter_channels=768,
    kernel_size=3,
    p_dropout=0.1,
    prenet=True
)

duration_params = SimpleNamespace(
    filter_channels_dp=256,
    kernel_size=3,
    p_dropout=0.5
)

# Create encoder
encoder = TextEncoder(
    encoder_type="transformer",
    encoder_params=encoder_params,
    duration_predictor_params=duration_params,
    n_vocab=148,
    n_spks=1,
    spk_emb_dim=128
)

# Example input
x = torch.randint(0, 148, (2, 50))  # batch_size=2, max_length=50
x_lengths = torch.LongTensor([50, 35])

# Forward pass
mu, logw, x_mask = encoder(x, x_lengths)

print(f"Encoder output shape: {mu.shape}")  # (2, 80, 50)
print(f"Duration prediction shape: {logw.shape}")  # (2, 1, 50)

Source Reference

Implementation: matcha/models/components/text_encoder.py:328

Models

CLI Commands

Utilities

TextEncoder

Overview

Class Definition

TextEncoder

Constructor Parameters

Methods

forward()

Parameters

Returns

Components

Encoder

Parameters

DurationPredictor

Parameters

MultiHeadAttention

Parameters

RotaryPositionalEmbeddings

Parameters

Example Usage

Source Reference

Build docs developers (and LLMs) love

Models

CLI Commands

Utilities

​Overview

​Class Definition

​TextEncoder

​Constructor Parameters

​Methods

​forward()

​Parameters

​Returns

​Components

​Encoder

​Parameters

​DurationPredictor

​Parameters

​MultiHeadAttention

​Parameters

​RotaryPositionalEmbeddings

​Parameters

​Example Usage

​Source Reference

Build docs developers (and LLMs) love

Overview

Class Definition

TextEncoder

Constructor Parameters

Methods

forward()

Parameters

Returns

Components

Encoder

Parameters

DurationPredictor

Parameters

MultiHeadAttention

Parameters

RotaryPositionalEmbeddings

Parameters

Example Usage

Source Reference