Phoneme duration extraction allows you to pre-compute alignment information from a trained model, which can be used to train models faster with more stable alignment.
Overview
Matcha-TTS learns to align text and audio automatically during training. Once you have a trained model, you can extract the learned phoneme durations and use them to:
- Train new models faster with pre-computed alignments
- Train duration-based models that require ground-truth durations
- Analyze the alignment quality of your model
- Bootstrap training on new architectures
Extract durations when you:
- Have a well-trained base model (good audio quality)
- Want to train additional models on the same dataset
- Need stable alignments for small datasets
- Are training other TTS architectures that require durations
Train a base model
First, train a Matcha-TTS model to convergence:python matcha/train.py experiment=ljspeech
Wait until the model produces good quality audio (typically 200k+ steps for LJSpeech). Extract durations
Use the get_durations_from_trained_model.py script:python matcha/utils/get_durations_from_trained_model.py \
-i ljspeech.yaml \
-c /path/to/checkpoint.ckpt
Or use the CLI command:matcha-tts-get-durations \
-i ljspeech.yaml \
-c matcha_ljspeech.ckpt
Wait for processing
The script processes all audio files in your train, validation, and test sets. This may take several minutes to hours depending on dataset size.You’ll see progress:Preprocessing: ljspeech from training filelist: data/LJSpeech-1.1/train.txt
Loading model...
Computing stats for training set if exists...
🍵 Computing durations 🍵: 100%|████████| 412/412 [05:23<00:00, 1.27it/s]
Computing stats for validation set if exists...
🍵 Computing durations 🍵: 100%|████████| 22/22 [00:17<00:00, 1.24it/s]
[+] Done! Data statistics saved to: data/durations
Command Options
Basic Usage
matcha-tts-get-durations -i <config.yaml> -c <checkpoint.ckpt>
All Options
matcha-tts-get-durations \
-i ljspeech.yaml \ # Dataset config file
-c model.ckpt \ # Checkpoint path
-b 32 \ # Batch size (default: 32)
-o /path/to/output \ # Custom output folder
-f \ # Force overwrite existing
--cpu # Use CPU instead of GPU
Parameters
-i, --input-config: Dataset YAML config name (from configs/data/)
-c, --checkpoint_path: Path to trained model checkpoint (required)
-b, --batch-size: Batch size for processing (default: 32, can be increased for faster processing)
-o, --output-folder: Custom output directory (default: data/durations)
-f, --force: Force overwrite if output folder exists
--cpu: Use CPU for inference (not recommended, much slower)
Durations are saved in two formats for each audio file:
Binary format containing duration values:
import numpy as np
durations = np.load('data/durations/LJ001-0001.npy')
# Array of integers: [3, 5, 4, 2, ...] # Duration in frames for each phoneme
Human-readable format with phoneme-duration pairs:
{
"phonemes": ["P", "R", "IH1", "N", "T", "IH0", "NG"],
"durations": [3, 5, 8, 4, 6, 7, 5],
"text": "Printing, in the only sense with which we are at present concerned."
}
Directory Structure
Durations are saved alongside your dataset:
data/
└── LJSpeech-1.1/
├── train.txt
├── val.txt
├── wavs/
│ ├── LJ001-0001.wav
│ └── ...
└── durations/ # Created by extraction
├── LJ001-0001.npy # Binary duration data
├── LJ001-0001.json # Human-readable format
└── ...
Once durations are extracted, train a new model using them:
Enable duration loading
In your dataset config (configs/data/ljspeech.yaml):Or create an experiment config:# configs/experiment/ljspeech_from_durations.yaml
# @package _global_
defaults:
- override /data: ljspeech.yaml
tags: ["ljspeech", "durations"]
run_name: ljspeech_from_durations
data:
load_durations: True
batch_size: 64 # Can use larger batch size
Start training
python matcha/train.py experiment=ljspeech_from_durations
Training with pre-computed durations:
- Converges faster (typically 50-100k steps vs 200k+ steps)
- More stable alignment learning
- Allows larger batch sizes (faster training)
Advanced Usage
Custom Output Location
Save durations to a specific directory:
matcha-tts-get-durations \
-i ljspeech.yaml \
-c checkpoint.ckpt \
-o /path/to/custom/durations
Update your dataset config to load from this location:
# The dataloader expects durations at:
# <audio_parent_dir>/durations/<audio_filename>.npy
Processing Large Datasets Faster
Increase batch size for faster processing:
matcha-tts-get-durations \
-i ljspeech.yaml \
-c checkpoint.ckpt \
-b 64 # Larger batch size
Large batch sizes require more GPU memory. Reduce if you encounter out-of-memory errors.
Multi-Speaker Datasets
The extraction process works identically for multi-speaker models:
matcha-tts-get-durations \
-i vctk.yaml \
-c multispeaker_checkpoint.ckpt \
-b 32
Duration File Expectations
When loading durations during training, the dataloader:
- Looks for durations in
<dataset_root>/durations/
- Expects filename:
<audio_basename>.npy
- Validates duration length matches phoneme sequence length
Validation
The dataloader performs automatic validation:
# matcha/data/text_mel_datamodule.py:195
assert len(durs) == len(text), \
f"Length of durations {len(durs)} and text {len(text)} do not match"
If you see this error:
- Durations were extracted with different text processing
- Checkpoint and config mismatch
- Corrupted duration files
Use Cases
Fast-Track Training
# 1. Train initial model (slow, learns alignment)
python matcha/train.py experiment=ljspeech
# Wait for convergence...
# 2. Extract durations
matcha-tts-get-durations -i ljspeech.yaml -c logs/.../checkpoint.ckpt
# 3. Train new models faster (pre-computed alignment)
python matcha/train.py experiment=ljspeech_from_durations
Duration Analysis
Analyze extracted durations for quality control:
import numpy as np
import json
# Load durations
with open('data/durations/LJ001-0001.json', 'r') as f:
dur_data = json.load(f)
print(f"Phonemes: {dur_data['phonemes']}")
print(f"Durations: {dur_data['durations']}")
print(f"Total frames: {sum(dur_data['durations'])}")
# Check for anomalies
durs = np.array(dur_data['durations'])
print(f"Mean duration: {durs.mean():.2f} frames")
print(f"Max duration: {durs.max()} frames")
print(f"Min duration: {durs.min()} frames")
Cross-Architecture Training
Use Matcha-TTS durations to train other architectures:
# Load Matcha-extracted durations for your custom TTS model
import numpy as np
durs = np.load('data/durations/audio.npy')
# Use in your model's training pipeline
Troubleshooting
Missing Duration Files
Error:
FileNotFoundError: Tried loading the durations but durations didn't exist at data/durations/LJ001-0001.npy
Solution:
- Ensure durations were extracted for all files
- Check output path is correct
- Verify extraction completed without errors
Length Mismatch
Error:
AssertionError: Length of durations 45 and text 47 do not match
Solution:
- Re-extract durations with matching config
- Ensure checkpoint and config are from same training run
- Check text cleaning settings match
Error:
RuntimeError: CUDA out of memory
Solution:
# Reduce batch size
matcha-tts-get-durations -i ljspeech.yaml -c checkpoint.ckpt -b 8
# Or use CPU (slower)
matcha-tts-get-durations -i ljspeech.yaml -c checkpoint.ckpt --cpu
Best Practices
- Extract durations from your best-performing checkpoint
- Validate duration quality before using for training
- Keep duration files in version control for reproducibility
- Document which checkpoint was used for extraction
- Verify duration lengths match your audio files
Next Steps