VITS Models
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) models provide fast, high-quality speech synthesis. They’re widely used in production applications and available from multiple sources: Piper, Coqui, MeloTTS, and MMS.Model Architecture
VITS is a single-model end-to-end TTS architecture:- Model (
model.onnxorvits-*.onnx) – Neural TTS model - Tokens (
tokens.txt) – Text token vocabulary - Optional:
lexicon.txt,espeak-ng-data(for phoneme-based models)
When to Use
Fast TTS
Real-time speech generation with low latency
Streaming Playback
Incremental audio generation for interactive apps
Multi-Speaker
Many voices available in a single model
Production Apps
Battle-tested, widely deployed models
VITS Variants
Piper
Piper is a collection of high-quality VITS models with excellent voice coverage:- Many languages and voices
- Fast inference
- Excellent quality
- Multiple speaker support
- Widely used in production
Coqui
Coqui VITS models:- High-quality voices
- Multilingual support
- Good for expressive speech
MeloTTS
MeloTTS models:- Optimized for speed
- Multilingual (English, Spanish, Chinese, etc.)
- Good quality with fast inference
MMS (Massively Multilingual Speech)
MMS from Meta:- 1000+ languages
- Good for low-resource languages
- Larger models, slower inference
Supported Languages
VITS models (especially Piper) support:- English (US, UK, and other accents) – Many voices
- Spanish, French, German, Italian, Portuguese
- Chinese, Japanese, Korean
- And many more (depends on model source)
Performance Characteristics
| Aspect | Rating | Notes |
|---|---|---|
| Streaming | ✅ Excellent | Native streaming support |
| Quality | ⭐⭐⭐⭐ | High quality, natural-sounding |
| Speed | ⭐⭐⭐⭐⭐ | Very fast, real-time capable |
| Memory | ⭐⭐⭐⭐ | Moderate, suitable for mobile |
| Model Size | Small-Medium | Typically 10-50 MB per voice |
Download Links
VITS/Piper Models
Download Piper, Coqui, MeloTTS, and MMS models
Configuration Example
Basic TTS
With Model Options
Streaming TTS with Live Playback
Multi-Speaker Selection
Model Options
VITS models support three tuning parameters:| Option | Type | Default | Description |
|---|---|---|---|
noiseScale | number | 0.667 | Controls voice variation. Lower = clearer, less expressive. Range: 0.0-1.0 |
noiseScaleW | number | 0.8 | Duration noise. Affects timing variation. Range: 0.0-1.0 |
lengthScale | number | 1.0 | Speech speed. < 1.0 = faster, > 1.0 = slower |
Tuning Examples
Runtime Parameter Updates
You can update parameters without reloading the model:Model Detection
VITS models are detected automatically:- Folder name should contain
vits(when used with other TTS models) - Files:
model.onnxorvits-*.onnx, plustokens.txt - Optional:
lexicon.txt,espeak-ng-datadirectory
Performance Tips
Optimize Thread Count
Use Streaming for Long Text
For text longer than a few sentences, use streaming to start playback earlier:Hardware Acceleration
Streaming Support
Streaming: ✅ YesVITS models have excellent streaming support. Use
generateSpeechStream() for low-latency, incremental audio generation.Advantages
- Fast Inference: Real-time capable on mobile devices
- High Quality: Natural-sounding speech
- Streaming: Native incremental generation
- Multi-Speaker: Many voices in a single model
- Wide Language Coverage: Especially with Piper models
- Small Size: 10-50 MB per voice
- Production-Ready: Battle-tested in many applications
Limitations
- No Voice Cloning: Cannot synthesize custom voices from reference audio (use Zipvoice or Pocket instead)
- Fixed Voices: Speaker selection limited to model’s trained voices
- Prosody Control: Limited control over emotion and emphasis
Use Cases
Voice Assistants
Fast, responsive voice interfaces
Screen Readers
Accessibility applications with streaming TTS
E-Learning
Clear narration for educational content
Audiobooks
Long-form audio generation with streaming
Navigation
Real-time turn-by-turn directions
Notifications
Short audio alerts and messages
Common Issues
Model not loading
Model not loading
- Verify
model.onnx(orvits-*.onnx) andtokens.txtare present - Check that model path is correct
- Ensure sufficient device memory
- For Piper models, ensure
espeak-ng-datadirectory is included if required
Poor audio quality
Poor audio quality
- Adjust
noiseScale(lower for clearer speech) - Try different
lengthScalevalues - Ensure correct sample rate for playback
- Check if audio is being resampled incorrectly
Slow generation
Slow generation
- Increase
numThreadson multi-core devices - Use hardware acceleration (
provider: 'nnapi'or'xnnpack') - Use smaller/faster VITS models
- Ensure no other heavy apps are running
Streaming audio stutters
Streaming audio stutters
- Ensure
onChunkhandler is lightweight - Write chunks to native player immediately (don’t buffer in JS)
- Increase buffer sizes in native audio player
- Use fewer threads to reduce chunk latency
Comparison with Other Models
| Feature | VITS | Matcha | Zipvoice | Kokoro |
|---|---|---|---|---|
| Speed | Very Fast | Fast | Medium | Fast |
| Quality | High | Very High | Very High | High |
| Streaming | Yes | Yes | No | Yes |
| Voice Cloning | No | No | Yes | No |
| Model Size | Small | Medium | Large | Small |
| Languages | Many | Limited | Limited | Multi |
Next Steps
TTS API
Detailed API documentation
Streaming TTS
Low-latency streaming guide
Model Setup
How to download and bundle models
Execution Providers
Hardware acceleration options