Overview
Matcha-TTS supports multi-speaker text-to-speech synthesis, allowing you to generate speech in different voices from a single model. The primary multi-speaker model is trained on the VCTK dataset with 108 speakers.Available Multi-Speaker Models
VCTK Model
The VCTK multi-speaker model supports 108 different speakers (IDs 0-107):- Model name:
matcha_vctk - Recommended vocoder:
hifigan_univ_v1 - Recommended speaking rate:
0.85 - Speaker range:
0-107(108 total speakers) - Download URL:
https://github.com/shivammehta25/Matcha-TTS-checkpoints/releases/download/v1.0/matcha_vctk.ckpt
Using Multi-Speaker Models
Basic Synthesis
Synthesize with a specific speaker:Speaker Selection
Choose from 108 available speakers (0-107):Default Speaker
If no speaker ID is provided, speaker 0 is used by default:Multi-Speaker Configuration
Recommended Settings
For the VCTK model, use these recommended parameters:Vocoder Selection
Multi-speaker models should use the universal HiFi-GAN vocoder (
hifigan_univ_v1) for best results across all speakers.ONNX Export for Multi-Speaker Models
Export a multi-speaker model to ONNX:spks input in the ONNX graph.
ONNX Inference with Speaker ID
Batched Multi-Speaker Synthesis
Synthesize multiple utterances with the same speaker:Speaker Parameters
Speaking Rate
Adjust the speaking rate for multi-speaker models:- < 1.0: Faster speech
- 0.85: Recommended default for VCTK
- > 1.0: Slower speech
Temperature
Control synthesis variation:- Lower (0.3-0.5): More consistent output
- Default (0.667): Balanced naturalness
- Higher (0.8-1.0): More variation
Custom Multi-Speaker Models
When using a custom multi-speaker model:Output File Naming
Multi-speaker synthesis includes the speaker ID in output filenames: Single utterance:Implementation Details
Speaker Embedding
Multi-speaker models use speaker embeddings internally:- Speaker ID is converted to a tensor (matcha/cli.py:285)
- Passed to the model’s
synthesisemethod (matcha/cli.py:372-379) - Conditioning the mel-spectrogram generation on speaker identity
Model Detection
The system automatically detects multi-speaker models by checking:Speaker ID Validation
The system validates speaker IDs (matcha/cli.py:174-184):- Checks if speaker ID is within valid range
- Raises an error if out of bounds
- Warns if using default speaker ID
VCTK Dataset Information
The VCTK (Voice Cloning Toolkit) dataset contains:- 108 English speakers with various accents
- Approximately 400 sentences per speaker
- High-quality studio recordings
- Diverse age ranges and accents
Tips for Best Results
- Use the recommended vocoder:
hifigan_univ_v1works best across all speakers - Adjust speaking rate: Start with 0.85 for VCTK, then adjust to preference
- Test different speakers: Each speaker has unique characteristics
- Keep temperature moderate: 0.667 provides good balance
- Batch process: Use batched mode for efficiency with multiple utterances
Troubleshooting
Speaker ID Out of Range
Error:Wrong Vocoder Warning
Warning:Single-Speaker Model with Speaker ID
Warning:--spk for single-speaker models like LJ Speech.
Next Steps
Pre-trained Models
Explore all available pre-trained models
ONNX Inference
Deploy multi-speaker models with ONNX