Matcha-TTS synthesis quality and characteristics are controlled by four main parameters. Understanding these parameters helps you achieve optimal results for different use cases.
Variance of the noise distribution in the diffusion process. Higher values increase prosody variation.
How it works:Temperature controls the variance of the initial noise distribution (x0) in the ODE-based synthesis process. This directly affects prosodic variation in the generated speech.From matcha/models/matcha_tts.py:76-110:
def synthesise(self, x, x_lengths, n_timesteps, temperature=1.0, spks=None, length_scale=1.0): """ Args: temperature (float, optional): controls variance of terminal distribution. """ # Temperature is passed to the decoder (CFM) decoder_outputs = self.decoder(mu_y, y_mask, n_timesteps, temperature, spks)
Value ranges and effects:
Temperature
Effect
Use Case
0.0 - 0.3
Deterministic, minimal variation
Robotic, consistent speech
0.4 - 0.6
Subtle variation
Professional narration
0.667
Natural prosody (default)
General use
0.7 - 1.0
Noticeable variation
Expressive reading
1.0 - 1.5
High variation
Emotional speech
1.5 - 2.0
Very high variation
Experimental, may be unstable
Examples:
matcha-tts --text "This is a test." --temperature 0.3
Number of ODE (Ordinary Differential Equation) steps for reverse diffusion. More steps = higher quality but slower synthesis.
How it works:Matcha-TTS uses Conditional Flow Matching (CFM), an ODE-based generative model. The n_timesteps parameter controls how many discrete steps are taken when solving the ODE from noise to mel-spectrogram.From matcha/models/matcha_tts.py:137-138:
# Generate sample tracing the probability flowdecoder_outputs = self.decoder(mu_y, y_mask, n_timesteps, temperature, spks)
More steps = finer-grained trajectory = higher quality but slower synthesis.Quality vs. Speed Trade-off:
Steps
Quality
Speed
RTF (approx)
Use Case
2
Fair
Very fast
0.01
Rapid prototyping
4
Good
Fast
0.02
Real-time applications
10
Very good (default)
Balanced
0.04
General use
20
Excellent
Moderate
0.08
High-quality production
50
Outstanding
Slow
0.20
Maximum quality
100
Marginal gain
Very slow
0.40
Research
RTF = Real-Time Factor. RTF < 1.0 means faster than real-time.Empirical findings:The CLI examples demonstrate different step counts (matcha/app.py:236-285):
# Examples showing step count variations[ [text, 2, 0.677, 0.95], # Very fast [text, 4, 0.677, 0.95], # Fast [text, 10, 0.677, 0.95], # Default [text, 50, 0.677, 0.95], # High quality]
Speaker ID for multi-speaker models. Range: 0-107 for VCTK. Not used for single-speaker models.
How it works:For multi-speaker models, the speaker ID is converted to an embedding:From matcha/models/matcha_tts.py:52-53, 114-116:
if n_spks > 1: self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim)# During synthesis:if self.n_spks > 1: spks = self.spk_emb(spks.long()) # Convert ID to embedding
The embedding is passed to both the encoder (for duration prediction) and decoder (for mel-spectrogram generation), conditioning the output on speaker characteristics.VCTK Speaker Range:From matcha/cli.py:30-32:
The CLI validates parameters (matcha/cli.py:134-159):
def validate_args(args): assert args.temperature >= 0, "Sampling temperature cannot be negative" assert args.steps > 0, "Number of ODE steps must be greater than 0" assert args.speaking_rate > 0, "Speaking rate must be greater than 0" # Speaker ID validation for VCTK if args.model in MULTISPEAKER_MODEL: spk_range = MULTISPEAKER_MODEL[args.model]["spk_range"] if args.spk is not None: assert ( args.spk >= spk_range[0] and args.spk <= spk_range[-1] ), f"Speaker ID must be between {spk_range} for this model."
Constraints:
temperature ≥ 0 (no upper limit, but > 2.0 may be unstable)
steps > 0 (integer)
speaking_rate > 0 (no upper limit, but > 2.0 is very slow)