Matcha Models
Matcha is a high-quality TTS model that uses an acoustic model + vocoder pipeline to produce very natural-sounding speech. It’s designed for applications where quality is the top priority.Model Architecture
Matcha uses a two-stage architecture:- Acoustic Model (
acoustic_model.onnx) – Generates mel-spectrogram from text - Vocoder (
vocoder.onnx) – Converts mel-spectrogram to waveform - Tokens (
tokens.txt) – Text token vocabulary
When to Use
High-Quality Audio
When naturalness and quality are more important than speed
Audiobook Narration
Professional-quality narration for long-form content
Content Creation
Voiceovers for videos, podcasts, and media
Expressive Speech
Natural prosody and intonation
Supported Languages
Matcha models are available for:- English (primary focus)
- Some multilingual variants
Performance Characteristics
| Aspect | Rating | Notes |
|---|---|---|
| Streaming | ✅ Supported | Streaming generation available |
| Quality | ⭐⭐⭐⭐⭐ | Excellent, very natural-sounding |
| Speed | ⭐⭐⭐⭐ | Fast, but slower than VITS |
| Memory | ⭐⭐⭐ | Moderate (two models: acoustic + vocoder) |
| Model Size | Medium | Typically 50-100 MB (acoustic + vocoder) |
Download Links
Matcha Models
Browse and download pretrained Matcha models
Configuration Example
Basic TTS
With Model Options
Streaming TTS
Save to File
Model Options
Matcha models support two tuning parameters:| Option | Type | Default | Description |
|---|---|---|---|
noiseScale | number | 0.667 | Controls voice variation and expressiveness. Range: 0.0-1.0 |
lengthScale | number | 1.0 | Speech speed. < 1.0 = faster, > 1.0 = slower |
Tuning Examples
Runtime Updates
Model Detection
Matcha models are detected automatically by:- Presence of
acoustic_model.onnx+vocoder.onnx - No folder name pattern required
acoustic_model.onnxvocoder.onnxtokens.txt
Performance Tips
Optimize Thread Count
Use Streaming for Long Text
For better perceived performance:Hardware Acceleration
Streaming Support
Streaming: ✅ YesMatcha models support streaming generation. Use
generateSpeechStream() for incremental audio generation and low-latency playback.Advantages
- Excellent Quality: Very natural-sounding speech
- Natural Prosody: Good intonation and rhythm
- Streaming: Supports incremental generation
- Acoustic Model + Vocoder: Flexible two-stage architecture
- Multi-Speaker: Some models support multiple speakers
Limitations
- Slower than VITS: Two-stage architecture is slightly slower
- Larger Size: Requires both acoustic model and vocoder
- No Voice Cloning: Cannot synthesize custom voices
- Limited Languages: Primarily English-focused
Use Cases
Audiobook Narration
Professional-quality long-form narration
Content Production
Voiceovers for videos and media
E-Learning
High-quality educational content
Podcasts
Natural-sounding podcast narration
Common Issues
Model not loading
Model not loading
- Verify both
acoustic_model.onnxandvocoder.onnxare present - Check that
tokens.txtexists - Ensure sufficient device memory for both models
Slow generation
Slow generation
- Increase
numThreadson multi-core devices - Use hardware acceleration (
provider: 'nnapi') - Consider using VITS for faster generation
- Ensure no other heavy apps are running
Audio quality issues
Audio quality issues
- Adjust
noiseScalefor more/less expressiveness - Try different
lengthScalevalues - Ensure correct sample rate for playback
- Check that vocoder output is not being resampled incorrectly
Comparison with Other Models
| Feature | Matcha | VITS | Zipvoice | Kokoro |
|---|---|---|---|---|
| Quality | Very High | High | Very High | High |
| Speed | Fast | Very Fast | Medium | Fast |
| Streaming | Yes | Yes | No | Yes |
| Voice Cloning | No | No | Yes | No |
| Model Size | Medium | Small | Large | Small |
| Architecture | Acoustic + Vocoder | End-to-End | Encoder + Decoder + Vocoder | End-to-End |
Next Steps
TTS API
Detailed API documentation
Streaming TTS
Low-latency streaming guide
Model Setup
How to download and bundle models
Execution Providers
Hardware acceleration options