Matcha-TTS
Matcha-TTS is a fast, high-quality text-to-speech (TTS) system that uses conditional flow matching to generate natural-sounding speech. Published at ICASSP 2024, it represents a new approach to non-autoregressive neural TTS.Check out the demo page to hear Matcha-TTS in action, or try it in your browser on HuggingFace Spaces.
What is Matcha-TTS?
Matcha-TTS uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. It’s designed as an encoder-decoder architecture trained using optimal-transport conditional flow matching (OT-CFM), yielding an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching.Key Features
Probabilistic
Uses stochastic modeling for natural speech variation
Compact Memory
Smallest memory footprint among comparable models
Highly Natural
Achieves highest mean opinion scores in listening tests
Very Fast
Rivals the speed of the fastest models on long utterances
Additional Capabilities
- Non-autoregressive: Generates speech in parallel for faster synthesis
- No external alignments needed: Learns to speak from scratch
- Multi-speaker support: Pre-trained models available for both single and multi-speaker scenarios
- Adjustable parameters: Control speaking rate, sampling temperature, and ODE solver steps
Architecture Overview
Matcha-TTS consists of three main components:Text Encoder
The text encoder processes phoneme sequences and predicts durations. It supports:- Speaker embeddings for multi-speaker models
- Duration prediction for alignment learning
- Flexible encoder architectures (Transformer-based)
Conditional Flow Matching (CFM) Decoder
The decoder uses conditional flow matching to transform noise into mel-spectrograms:- Uses ODE-based synthesis with configurable timesteps
- Employs optimal transport for efficient training
- Generates high-quality mel-spectrograms with few steps (as low as 2-4 steps)
Vocoder
Matcha-TTS uses HiFi-GAN vocoders to convert mel-spectrograms to audio:hifigan_T2_v1: Optimized for single-speaker (LJSpeech) modelshifigan_univ_v1: Universal vocoder for multi-speaker (VCTK) models
Pre-trained models are automatically downloaded when using the CLI or Gradio interface.
Model Architecture Details
Performance
Matcha-TTS achieves impressive real-time factors (RTF):- Matcha-TTS only: RTF ~0.017 (58x faster than real-time)
- With HiFi-GAN vocoder: RTF ~0.021 (47x faster than real-time)
Use Cases
- Audiobook narration: Fast, natural-sounding long-form content
- Voice assistants: Low-latency speech synthesis
- Content creation: Multi-speaker voice generation
- Accessibility tools: Text-to-speech for screen readers
- Research: Baseline for TTS research and experimentation
Citation
If you use Matcha-TTS in your research, please cite:Next Steps
Installation
Install Matcha-TTS via pip or from source
Quick Start
Start synthesizing speech in minutes