Multi-Speaker Setup

Overview

Matcha-TTS supports multi-speaker text-to-speech synthesis, allowing you to generate speech in different voices from a single model. The primary multi-speaker model is trained on the VCTK dataset with 108 speakers.

Available Multi-Speaker Models

VCTK Model

The VCTK multi-speaker model supports 108 different speakers (IDs 0-107):

matcha-tts --model matcha_vctk --spk 5 --text "Hello, world!"

Model Configuration (matcha/cli.py:30-32):

Model name: matcha_vctk
Recommended vocoder: hifigan_univ_v1
Recommended speaking rate: 0.85
Speaker range: 0-107 (108 total speakers)
Download URL: https://github.com/shivammehta25/Matcha-TTS-checkpoints/releases/download/v1.0/matcha_vctk.ckpt

Using Multi-Speaker Models

Basic Synthesis

Synthesize with a specific speaker:

matcha-tts --model matcha_vctk --spk 10 --text "Your text here"

Speaker Selection

Choose from 108 available speakers (0-107):

# Speaker 0
matcha-tts --model matcha_vctk --spk 0 --text "Hello from speaker zero"

# Speaker 50
matcha-tts --model matcha_vctk --spk 50 --text "Hello from speaker fifty"

# Speaker 107 (last speaker)
matcha-tts --model matcha_vctk --spk 107 --text "Hello from speaker one oh seven"

Speaker ID must be between 0 and 107 for the VCTK model. Values outside this range will cause an error (matcha/cli.py:175-178).

Default Speaker

If no speaker ID is provided, speaker 0 is used by default:

matcha-tts --model matcha_vctk --text "Uses speaker 0"

A warning will be displayed (matcha/cli.py:181-183):

[!] Speaker ID not provided! Using speaker ID 0

Multi-Speaker Configuration

Recommended Settings

For the VCTK model, use these recommended parameters:

matcha-tts --model matcha_vctk \
  --spk 10 \
  --vocoder hifigan_univ_v1 \
  --speaking_rate 0.85 \
  --text "Your text here"

Vocoder Selection

Multi-speaker models should use the universal HiFi-GAN vocoder (hifigan_univ_v1) for best results across all speakers.

The system automatically selects the correct vocoder if not specified:

# Vocoder automatically set to hifigan_univ_v1
matcha-tts --model matcha_vctk --spk 10 --text "Hello"

ONNX Export for Multi-Speaker Models

Export a multi-speaker model to ONNX:

python3 -m matcha.onnx.export matcha_vctk.ckpt model.onnx --n-timesteps 5

The exporter automatically detects multi-speaker models (matcha/onnx/export.py:134) and includes the spks input in the ONNX graph.

ONNX Inference with Speaker ID

python3 -m matcha.onnx.infer model.onnx \
  --text "Hello" \
  --spk 25 \
  --output-dir ./outputs

Batched Multi-Speaker Synthesis

Synthesize multiple utterances with the same speaker:

# Create a text file
cat > sentences.txt << EOF
First sentence to synthesize.
Second sentence to synthesize.
Third sentence to synthesize.
EOF

# Batch synthesis with speaker 42
matcha-tts --model matcha_vctk \
  --file sentences.txt \
  --spk 42 \
  --batched \
  --batch_size 32

All sentences will be synthesized using the same speaker voice.

Speaker Parameters

Speaking Rate

Adjust the speaking rate for multi-speaker models:

matcha-tts --model matcha_vctk \
  --spk 10 \
  --speaking_rate 0.85 \
  --text "Adjust my speed"

< 1.0: Faster speech
0.85: Recommended default for VCTK
> 1.0: Slower speech

Temperature

Control synthesis variation:

matcha-tts --model matcha_vctk \
  --spk 10 \
  --temperature 0.667 \
  --text "Control my variation"

Lower (0.3-0.5): More consistent output
Default (0.667): Balanced naturalness
Higher (0.8-1.0): More variation

Custom Multi-Speaker Models

When using a custom multi-speaker model:

matcha-tts --checkpoint_path ./my_model.ckpt \
  --vocoder hifigan_univ_v1 \
  --spk 0 \
  --text "Custom model synthesis"

For custom multi-speaker models, use --vocoder hifigan_univ_v1 for best compatibility across different speakers (matcha/cli.py:150-152).

Output File Naming

Multi-speaker synthesis includes the speaker ID in output filenames: Single utterance:

utterance_001_speaker_010.wav

Batched synthesis:

utterance_001_speaker_010.wav
utterance_002_speaker_010.wav
utterance_003_speaker_010.wav

Implementation Details

Speaker Embedding

Multi-speaker models use speaker embeddings internally:

Speaker ID is converted to a tensor (matcha/cli.py:285)
Passed to the model’s synthesise method (matcha/cli.py:372-379)
Conditioning the mel-spectrogram generation on speaker identity

Model Detection

The system automatically detects multi-speaker models by checking:

is_multi_speaker = matcha.n_spks > 1

This applies to both PyTorch and ONNX inference paths.

Speaker ID Validation

The system validates speaker IDs (matcha/cli.py:174-184):

Checks if speaker ID is within valid range
Raises an error if out of bounds
Warns if using default speaker ID

assert (
    args.spk >= spk_range[0] and args.spk <= spk_range[-1]
), f"Speaker ID must be between {spk_range} for this model."

VCTK Dataset Information

The VCTK (Voice Cloning Toolkit) dataset contains:

108 English speakers with various accents
Approximately 400 sentences per speaker
High-quality studio recordings
Diverse age ranges and accents

Tips for Best Results

Use the recommended vocoder: hifigan_univ_v1 works best across all speakers
Adjust speaking rate: Start with 0.85 for VCTK, then adjust to preference
Test different speakers: Each speaker has unique characteristics
Keep temperature moderate: 0.667 provides good balance
Batch process: Use batched mode for efficiency with multiple utterances

Troubleshooting

Speaker ID Out of Range

Error:

Speaker ID must be between (0, 107) for this model.

Solution: Use a speaker ID between 0 and 107.

Wrong Vocoder Warning

Warning:

[-] Using matcha_vctk model! I would suggest passing --vocoder hifigan_univ_v1

Solution: Specify the recommended vocoder:

matcha-tts --model matcha_vctk --vocoder hifigan_univ_v1 --spk 10 --text "Hello"

Single-Speaker Model with Speaker ID

Warning:

[-] Ignoring speaker id 10 for matcha_ljspeech

Solution: Don’t specify --spk for single-speaker models like LJ Speech.

Next Steps

Pre-trained Models

Explore all available pre-trained models

ONNX Inference

Deploy multi-speaker models with ONNX

Get Started

Core Concepts

Training

Inference

Advanced

Overview

Available Multi-Speaker Models

VCTK Model

Using Multi-Speaker Models

Basic Synthesis

Speaker Selection

Default Speaker

Multi-Speaker Configuration

Recommended Settings

Vocoder Selection

ONNX Export for Multi-Speaker Models

ONNX Inference with Speaker ID

Batched Multi-Speaker Synthesis

Speaker Parameters

Speaking Rate

Temperature

Custom Multi-Speaker Models

Output File Naming

Implementation Details

Speaker Embedding

Model Detection

Speaker ID Validation

VCTK Dataset Information

Tips for Best Results

Troubleshooting

Speaker ID Out of Range

Wrong Vocoder Warning

Single-Speaker Model with Speaker ID

Next Steps

Pre-trained Models

ONNX Inference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training

Inference

Advanced

​Overview

​Available Multi-Speaker Models

​VCTK Model

​Using Multi-Speaker Models

​Basic Synthesis

​Speaker Selection

​Default Speaker

​Multi-Speaker Configuration

​Recommended Settings

​Vocoder Selection

​ONNX Export for Multi-Speaker Models

​ONNX Inference with Speaker ID

​Batched Multi-Speaker Synthesis

​Speaker Parameters

​Speaking Rate

​Temperature

​Custom Multi-Speaker Models

​Output File Naming

​Implementation Details

​Speaker Embedding

​Model Detection

​Speaker ID Validation

​VCTK Dataset Information

​Tips for Best Results

​Troubleshooting

​Speaker ID Out of Range

​Wrong Vocoder Warning

​Single-Speaker Model with Speaker ID

​Next Steps

Pre-trained Models

ONNX Inference

Build docs developers (and LLMs) love

Overview

Available Multi-Speaker Models

VCTK Model

Using Multi-Speaker Models

Basic Synthesis

Speaker Selection

Default Speaker

Multi-Speaker Configuration

Recommended Settings

Vocoder Selection

ONNX Export for Multi-Speaker Models

ONNX Inference with Speaker ID

Batched Multi-Speaker Synthesis

Speaker Parameters

Speaking Rate

Temperature

Custom Multi-Speaker Models

Output File Naming

Implementation Details

Speaker Embedding

Model Detection

Speaker ID Validation

VCTK Dataset Information

Tips for Best Results

Troubleshooting

Speaker ID Out of Range

Wrong Vocoder Warning

Single-Speaker Model with Speaker ID

Next Steps