Overview
Matcha-TTS includes a Gradio-based web interface for interactive text-to-speech synthesis. The interface provides real-time parameter adjustment, speaker selection, and immediate audio playback.Launching the Interface
Features
Model Selection
Choose between two pre-trained models:- Multi Speaker (VCTK): 108 speakers from the VCTK dataset
- Single Speaker (LJ Speech): High-quality single female speaker
- Adjusts the speaker selector visibility
- Updates default speaking rate
- Switches example sets
- Loads appropriate vocoder
Text Input
The main text box accepts any English text. The system will:- Clean and normalize the text
- Convert to phonemes using
english_cleaners2 - Display the phonetized sequence
- Synthesize speech
Speaker Selection
Multi-Speaker Mode (VCTK) A slider appears with speaker IDs from 0 to 107. Each ID represents a different speaker from the VCTK dataset with unique voice characteristics. Single-Speaker Mode (LJ Speech) The speaker selector is hidden as only one voice is available.Synthesis Controls
The interface exposes three key hyperparameters:Number of ODE Steps
- Range: 1-100
- Default: 10
- Impact: Quality vs. speed tradeoff
Length Scale (Speaking Rate)
- Range: 0.5-1.5
- Default: 0.85 (VCTK) / 0.95 (LJ Speech)
- Impact: Speech pace
Sampling Temperature
- Range: 0.00-2.00
- Default: 0.667
- Step: 0.16675
- Impact: Prosody variation
Real-Time Output
After synthesis, the interface displays:- Phonetized Text: Shows how the text was converted to phonemes
- Mel-Spectrogram: Visual representation of the generated speech
- Audio Player: Playback controls with download option
Interface Code Structure
Main Components
The Gradio app is defined inmatcha/app.py:149-357:
Synthesis Pipeline
The synthesis process follows two steps: Step 1: Text Processing (matcha/app.py:102-104)
matcha/app.py:108-122)
Dynamic Model Loading
The interface loads models on-demand (matcha/app.py:72-98):
Pre-Cached Examples
The interface includes cached examples for instant playback without synthesis:Single Speaker Examples
Multi-Speaker Examples
cache_examples=True to avoid re-synthesis.
Model Configuration
Default configuration (matcha/app.py:23-28):
matcha/app.py:42-51):
Technical Details
Audio Output Specifications
- Sample Rate: 22050 Hz
- Format: WAV (PCM_24)
- Channels: Mono
- Temporary Storage: Files created in system temp directory
GPU Support
The interface automatically detects GPU availability:Model Downloads
All models are automatically downloaded on first use (matcha/app.py:54-57):
~/.local/share/matcha_tts/).
Customization
You can customize the interface by modifyingmatcha/app.py:
Add Custom Examples
Adjust Default Parameters
Share Publicly
The interface launches withshare=True by default, creating a public URL:
share=False for local-only access.
Tips for Best Results
- Start with examples: Click cached examples to hear quality before experimenting
- Adjust one parameter at a time: Easier to understand each parameter’s effect
- Use 10 steps as baseline: Good balance for testing, increase for final output
- Keep temperature at 0.667: Sweet spot for natural prosody
- Try different speakers: VCTK has diverse voices (try 0, 16, 44, 45, 58)
- Check phonetized text: Verify correct pronunciation before synthesis