Skip to main content

Overview

Matcha-TTS synthesis quality and characteristics are controlled by four main parameters. Understanding these parameters helps you achieve optimal results for different use cases.

Core Parameters

temperature

Controls prosody variation and expressiveness
temperature
float
default:"0.667"
Variance of the noise distribution in the diffusion process. Higher values increase prosody variation.
How it works: Temperature controls the variance of the initial noise distribution (x0) in the ODE-based synthesis process. This directly affects prosodic variation in the generated speech. From matcha/models/matcha_tts.py:76-110:
def synthesise(self, x, x_lengths, n_timesteps, temperature=1.0, spks=None, length_scale=1.0):
    """
    Args:
        temperature (float, optional): controls variance of terminal distribution.
    """
    # Temperature is passed to the decoder (CFM)
    decoder_outputs = self.decoder(mu_y, y_mask, n_timesteps, temperature, spks)
Value ranges and effects:
TemperatureEffectUse Case
0.0 - 0.3Deterministic, minimal variationRobotic, consistent speech
0.4 - 0.6Subtle variationProfessional narration
0.667Natural prosody (default)General use
0.7 - 1.0Noticeable variationExpressive reading
1.0 - 1.5High variationEmotional speech
1.5 - 2.0Very high variationExperimental, may be unstable
Examples:
matcha-tts --text "This is a test." --temperature 0.3
Recommendations:
  • Audiobooks/Podcasts: 0.6 - 0.7
  • Voice assistants: 0.5 - 0.667
  • Character voices: 0.8 - 1.2
  • Consistent announcements: 0.3 - 0.5

speaking_rate (length_scale)

Controls speech pace and duration
speaking_rate
float
default:"0.95 (LJ) / 0.85 (VCTK)"
Also called length_scale. Controls speech pace. Higher values = slower speech, longer duration.
How it works: The speaking rate is implemented as a scaling factor applied to predicted phoneme durations: From matcha/models/matcha_tts.py:118-123:
w = torch.exp(logw) * x_mask
w_ceil = torch.ceil(w) * length_scale  # Applied here!
y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
The parameter multiplies predicted durations, directly controlling output length:
  • length_scale = 2.0 → 2x slower (twice as long)
  • length_scale = 0.5 → 2x faster (half as long)
Value ranges and effects:
Speaking RateRelative SpeedEffect
0.52x fasterVery fast, may lose clarity
0.7~1.4x fasterFast but clear
0.85Normal (VCTK default)Natural for multi-speaker
0.95Normal (LJ default)Natural for single-speaker
1.0BaselineNeutral pace
1.20.83x speedSlightly slower
1.50.67x speedSlow, deliberate
Model-specific defaults: From matcha/cli.py:30-34:
MULTISPEAKER_MODEL = {
    "matcha_vctk": {"speaking_rate": 0.85}  # Slightly faster
}
SINGLESPEAKER_MODEL = {
    "matcha_ljspeech": {"speaking_rate": 0.95}  # Near baseline
}
Examples:
matcha-tts --text "Quick announcement." --speaking_rate 0.7
Recommendations:
  • News reading: 0.95 - 1.1
  • Casual conversation: 0.85 - 0.95
  • Instructions/tutorials: 1.0 - 1.2
  • Accessibility (slower): 1.2 - 1.5
  • Time-constrained (faster): 0.7 - 0.85

steps (n_timesteps)

Controls synthesis quality and speed
steps
int
default:"10"
Number of ODE (Ordinary Differential Equation) steps for reverse diffusion. More steps = higher quality but slower synthesis.
How it works: Matcha-TTS uses Conditional Flow Matching (CFM), an ODE-based generative model. The n_timesteps parameter controls how many discrete steps are taken when solving the ODE from noise to mel-spectrogram. From matcha/models/matcha_tts.py:137-138:
# Generate sample tracing the probability flow
decoder_outputs = self.decoder(mu_y, y_mask, n_timesteps, temperature, spks)
More steps = finer-grained trajectory = higher quality but slower synthesis. Quality vs. Speed Trade-off:
StepsQualitySpeedRTF (approx)Use Case
2FairVery fast0.01Rapid prototyping
4GoodFast0.02Real-time applications
10Very good (default)Balanced0.04General use
20ExcellentModerate0.08High-quality production
50OutstandingSlow0.20Maximum quality
100Marginal gainVery slow0.40Research
RTF = Real-Time Factor. RTF < 1.0 means faster than real-time. Empirical findings: The CLI examples demonstrate different step counts (matcha/app.py:236-285):
# Examples showing step count variations
[
    [text, 2, 0.677, 0.95],   # Very fast
    [text, 4, 0.677, 0.95],   # Fast
    [text, 10, 0.677, 0.95],  # Default
    [text, 50, 0.677, 0.95],  # High quality
]
Examples:
matcha-tts --text "Quick test." --steps 4
Recommendations:
  • Development/testing: 4 - 10 steps
  • Production (quality): 10 - 20 steps
  • Real-time applications: 2 - 4 steps
  • Offline batch processing: 20 - 50 steps
  • Research/benchmarking: 50+ steps

spk (Speaker ID)

Selects voice for multi-speaker models
spk
int
default:"0 (VCTK) / None (LJ)"
Speaker ID for multi-speaker models. Range: 0-107 for VCTK. Not used for single-speaker models.
How it works: For multi-speaker models, the speaker ID is converted to an embedding: From matcha/models/matcha_tts.py:52-53, 114-116:
if n_spks > 1:
    self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim)

# During synthesis:
if self.n_spks > 1:
    spks = self.spk_emb(spks.long())  # Convert ID to embedding
The embedding is passed to both the encoder (for duration prediction) and decoder (for mel-spectrogram generation), conditioning the output on speaker characteristics. VCTK Speaker Range: From matcha/cli.py:30-32:
MULTISPEAKER_MODEL = {
    "matcha_vctk": {
        "spk": 0,           # Default speaker
        "spk_range": (0, 107)  # Valid range
    }
}
The VCTK dataset contains 108 speakers with diverse characteristics:
  • Different genders (male/female)
  • Various ages
  • Multiple accents
  • Different speaking styles
Examples:
matcha-tts --model matcha_vctk --spk 42 --text "Hello from speaker 42."
Popular VCTK Speakers (from Gradio examples): From matcha/app.py:288-331:
  • Speaker 0: Default, neutral voice
  • Speaker 16: Popular in examples
  • Speaker 44: Distinct characteristics
  • Speaker 45: Alternative voice
  • Speaker 58: Used in fast synthesis example
Recommendations:
  • Explore speakers: Try IDs 0, 10, 20, 30, … to find preferred voices
  • Consistency: Use same speaker ID for related content
  • Diversity: Mix speakers for dialogue or multi-character content
  • Custom models: Check model documentation for valid speaker ranges

Parameter Combinations

Fast Preview:
matcha-tts --text "Preview mode" --steps 4 --temperature 0.667 --speaking_rate 0.95
Balanced Quality:
matcha-tts --text "General use" --steps 10 --temperature 0.667 --speaking_rate 0.95
Maximum Quality:
matcha-tts --text "Production audio" --steps 50 --temperature 0.667 --speaking_rate 0.95
Expressive Speech:
matcha-tts --text "Expressive reading" --steps 20 --temperature 1.0 --speaking_rate 0.9
Consistent Narration:
matcha-tts --text "Audiobook chapter" --steps 20 --temperature 0.5 --speaking_rate 1.0

Python Configuration Dictionary

# Define presets
PRESETS = {
    "fast": {
        "n_timesteps": 4,
        "temperature": 0.667,
        "length_scale": 0.95,
    },
    "balanced": {
        "n_timesteps": 10,
        "temperature": 0.667,
        "length_scale": 0.95,
    },
    "quality": {
        "n_timesteps": 50,
        "temperature": 0.667,
        "length_scale": 0.95,
    },
    "expressive": {
        "n_timesteps": 20,
        "temperature": 1.0,
        "length_scale": 0.9,
    },
}

# Use preset
preset = PRESETS["balanced"]
output = model.synthesise(
    text_data["x"],
    text_data["x_lengths"],
    **preset
)

Parameter Validation

The CLI validates parameters (matcha/cli.py:134-159):
def validate_args(args):
    assert args.temperature >= 0, "Sampling temperature cannot be negative"
    assert args.steps > 0, "Number of ODE steps must be greater than 0"
    assert args.speaking_rate > 0, "Speaking rate must be greater than 0"
    
    # Speaker ID validation for VCTK
    if args.model in MULTISPEAKER_MODEL:
        spk_range = MULTISPEAKER_MODEL[args.model]["spk_range"]
        if args.spk is not None:
            assert (
                args.spk >= spk_range[0] and args.spk <= spk_range[-1]
            ), f"Speaker ID must be between {spk_range} for this model."
Constraints:
  • temperature ≥ 0 (no upper limit, but > 2.0 may be unstable)
  • steps > 0 (integer)
  • speaking_rate > 0 (no upper limit, but > 2.0 is very slow)
  • spk in valid range for model (0-107 for VCTK)

Performance Considerations

Real-Time Factor (RTF)

RTF measures synthesis speed relative to audio duration:
RTF = synthesis_time / audio_duration

RTF < 1.0: Faster than real-time (e.g., 0.05 = 20x faster)
RTF = 1.0: Real-time
RTF > 1.0: Slower than real-time
From matcha/models/matcha_tts.py:141-142:
t = (dt.datetime.now() - t).total_seconds()
rtf = t * 22050 / (decoder_outputs.shape[-1] * 256)
Typical RTF values (GPU):
StepsMatcha RTF+ Vocoder RTF
20.0050.010
40.0100.020
100.0200.040
500.1000.200

Memory Usage

Memory scales with:
  • Batch size (linear)
  • Sequence length (linear)
  • Number of steps (minimal impact)
Typical GPU memory:
  • Single utterance: ~1 GB
  • Batch of 32: ~4-8 GB

Advanced Usage

Parameter Sweeps

import itertools

# Grid search
temperatures = [0.3, 0.667, 1.0]
steps = [4, 10, 20]
speaking_rates = [0.8, 0.95, 1.1]

for temp, n_steps, rate in itertools.product(temperatures, steps, speaking_rates):
    output = model.synthesise(
        text_data["x"],
        text_data["x_lengths"],
        n_timesteps=n_steps,
        temperature=temp,
        length_scale=rate
    )
    # Save with descriptive filename
    filename = f"t{temp}_s{n_steps}_r{rate}.wav"
    # ...

Dynamic Parameter Adjustment

# Adjust parameters based on text length
text_length = text_data["x_lengths"].item()

if text_length < 50:
    steps = 20  # Short text, use more steps
elif text_length < 200:
    steps = 10  # Medium text, balanced
else:
    steps = 4   # Long text, prioritize speed

output = model.synthesise(
    text_data["x"],
    text_data["x_lengths"],
    n_timesteps=steps,
    temperature=0.667
)

Troubleshooting

Poor Quality

  • Increase steps: Try 20 or 50
  • Check temperature: Use 0.667 as baseline
  • Verify text: Check phonetized output for errors

Unnatural Prosody

  • Adjust temperature: Try 0.5-0.8 range
  • Check speaking_rate: Ensure appropriate for content

Too Fast/Slow

  • Adjust speaking_rate: Increase to slow down, decrease to speed up
  • Model-specific defaults: Use 0.95 (LJ) or 0.85 (VCTK)

Out of Memory

  • Reduce batch size: Use --batch_size 8 or --batch_size 4
  • Disable batching: Remove --batched flag
  • Use CPU: Add --cpu flag

See Also

Build docs developers (and LLMs) love