Overview
The matcha-tts-get-durations utility extracts phoneme durations from a trained Matcha-TTS model using Monotonic Alignment Search (MAS). These durations can be used for training other TTS architectures or analyzing alignment patterns.
Command Line Interface
matcha-tts-get-durations
matcha-tts-get-durations -c <checkpoint> -i <config> [OPTIONS]
Arguments
Path to the trained Matcha-TTS checkpoint file
-i, --input-config
str
default:"ljspeech.yaml"
Name of the YAML config file under configs/data/
Batch size for processing (higher = faster but more memory)
Output folder for duration files. Defaults to <data_path>/durations/
Force overwrite existing duration folder
Use CPU instead of GPU (not recommended, much slower)
Python API
compute_durations()
Compute durations for all samples in a dataloader.
from matcha.utils.get_durations_from_trained_model import compute_durations
compute_durations(
data_loader=train_loader,
model=matcha_model,
device=torch.device("cuda"),
output_folder=Path("./durations")
)
Parameters
data_loader
torch.utils.data.DataLoader
required
DataLoader containing text and mel-spectrogram pairs
Device to run inference on (“cuda” or “cpu”)
Directory to save duration files
save_durations_to_folder()
Save durations for a single sample.
from matcha.utils.get_durations_from_trained_model import save_durations_to_folder
save_durations_to_folder(
attn=attention_matrix,
x_length=text_length,
y_length=mel_length,
filepath="LJ001-0001.wav",
output_folder=Path("./durations"),
text="phoneme_sequence"
)
Parameters
Attention/alignment matrix from modelShape: (1, text_length, mel_length)
Actual length of text/phoneme sequence
Actual length of mel-spectrogram
Original audio file path (used for naming output)
Directory to save duration files
Phoneme sequence for JSON output
Output Files
For each utterance, two files are generated:
NumPy Array (.npy)
import numpy as np
durations = np.load("LJ001-0001.npy")
# Shape: (text_length,)
# Example: [3, 5, 2, 8, 4, ...] # frames per phoneme
JSON File (.json)
{
"phonemes": ["pau", "HH", "EH", "L", "OW", "pau"],
"durations": [3, 5, 2, 8, 4, 2],
"total_frames": 24
}
Examples
Basic Usage
matcha-tts-get-durations \
-c checkpoints/best_model.ckpt \
-i ljspeech.yaml
Output:
Loading model...
[+] matcha_custom loaded!
Computing stats for training set if exists...
🍵 Computing durations 🍵: 100%|██████████| 410/410 [02:15<00:00]
Computing stats for validation set if exists...
🍵 Computing durations 🍵: 100%|██████████| 50/50 [00:15<00:00]
[+] Done! Data statistics saved to: data/ljspeech/durations
Custom Output Folder
matcha-tts-get-durations \
-c model.ckpt \
-i vctk.yaml \
-o ./my_durations \
-b 64
Force Overwrite
matcha-tts-get-durations \
-c model.ckpt \
-i ljspeech.yaml \
-f
CPU Mode
matcha-tts-get-durations \
-c model.ckpt \
-i ljspeech.yaml \
--cpu
Python Script Example
import torch
from pathlib import Path
from matcha.models.matcha_tts import MatchaTTS
from matcha.data.text_mel_datamodule import TextMelDataModule
from matcha.utils.get_durations_from_trained_model import compute_durations
# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MatchaTTS.load_from_checkpoint(
"checkpoints/best_model.ckpt",
map_location=device
)
model.eval()
# Setup data
datamodule = TextMelDataModule(
train_filelist_path="data/ljspeech/train.txt",
valid_filelist_path="data/ljspeech/val.txt",
batch_size=32,
# ... other config
)
datamodule.setup()
# Compute durations
output_folder = Path("./durations")
output_folder.mkdir(exist_ok=True)
print("Processing training set...")
train_loader = datamodule.train_dataloader()
compute_durations(train_loader, model, device, output_folder)
print("Processing validation set...")
val_loader = datamodule.val_dataloader()
compute_durations(val_loader, model, device, output_folder)
print(f"Durations saved to {output_folder}")
- Forward Pass: Run model with text and mel inputs
- Get Alignment: Extract attention matrix from MAS
- Sum Durations: Sum attention weights along mel dimension
- Save Files: Write NumPy and JSON files
# Simplified extraction logic
attn = model(x, x_lengths, y, y_lengths)[-1] # Get alignment
durations = attn.squeeze().sum(dim=1)[:x_length] # Sum over mel frames
Use Cases
Training Non-Autoregressive TTS
Use extracted durations to train models that require ground-truth alignments:
# In your TTS model
durations = np.load(f"{audio_id}.npy")
expanded_text = expand_with_durations(text, durations)
Alignment Analysis
Analyze alignment patterns:
import matplotlib.pyplot as plt
durations = np.load("LJ001-0001.npy")
plt.bar(range(len(durations)), durations)
plt.xlabel("Phoneme Index")
plt.ylabel("Duration (frames)")
plt.title("Phoneme Durations")
plt.show()
Prosody Transfer
Transfer duration patterns between utterances:
source_durations = np.load("source.npy")
target_text = load_text("target.txt")
# Apply source durations to target text
Processing Speed
For LJSpeech (13,100 samples):
- GPU (RTX 3090): ~3-5 minutes (batch_size=32)
- GPU (RTX 3090): ~2-3 minutes (batch_size=64)
- CPU: ~45-60 minutes (batch_size=32)
Memory Usage
- Batch size 32: ~4 GB GPU memory
- Batch size 64: ~8 GB GPU memory
- Batch size 16: ~2 GB GPU memory
Error Handling
Checkpoint Not Found
FileNotFoundError: Checkpoint not found: model.ckpt
Solution: Verify checkpoint path
Folder Already Exists
Folder already exists. Use -f to force overwrite
Solution: Use -f flag or delete folder
Out of Memory
RuntimeError: CUDA out of memory
Solution: Reduce batch size with -b 16 or -b 8
Configuration Requirements
Your data config must include:
train_filelist_path: "data/train.txt"
valid_filelist_path: "data/val.txt"
batch_size: 32
n_feats: 80
# ... other parameters
Output Structure
durations/
├── LJ001-0001.npy
├── LJ001-0001.json
├── LJ001-0002.npy
├── LJ001-0002.json
└── ...
Best Practices
- Use trained model: Ensure model is well-trained for accurate alignments
- GPU inference: Always use GPU unless absolutely necessary
- Batch processing: Use higher batch sizes for faster extraction
- Verify outputs: Spot-check some duration files for correctness
- Version control: Track which model checkpoint generated the durations
Integration Example
Using durations in a dataloader:
class TTSDataset(torch.utils.data.Dataset):
def __getitem__(self, idx):
audio_path, text = self.data[idx]
# Load precomputed durations
duration_path = audio_path.replace(".wav", ".npy")
durations = np.load(duration_path)
return {
"text": text,
"mel": load_mel(audio_path),
"durations": torch.from_numpy(durations)
}
Source Reference
Implementation: matcha/utils/get_durations_from_trained_model.py:43
Entry Point: setup.py:48