Other TTS Models
This page covers additional TTS model types including lightweight KittenTTS, voice cloning with Zipvoice, and flow-matching Pocket models.Overview
KittenTTS
Lightweight, multi-speaker TTS
Zipvoice
Zero-shot voice cloning
Flow-matching TTS with voice cloning
KittenTTS
modelType: 'kitten'
Description
KittenTTS is a lightweight, fast, multi-speaker TTS model optimized for resource-constrained devices.Characteristics
- Streaming: ✅ Yes
- Quality: ⭐⭐⭐ Good
- Speed: ⭐⭐⭐⭐⭐ Very Fast
- Memory: ⭐⭐⭐⭐⭐ Very Low
- Size: Very Small (typically 10-30 MB)
- Multi-Speaker: ✅ Yes
Configuration
Streaming Example
Download
KittenTTS Models
Download KittenTTS models
Model Detection
- Folder name should contain
kitten(notkokoro) - Files:
model.onnx,tokens.txt
When to Use
Low-End Devices
Resource-constrained mobile devices
Fast Response
Applications requiring minimal latency
Battery Efficiency
Low power consumption for longer battery life
Embedded Systems
IoT devices with limited resources
Advantages
- Very Fast: Fastest TTS model available
- Very Small: Minimal storage footprint
- Low Memory: Runs on constrained devices
- Streaming: Low-latency incremental generation
- Multi-Speaker: Multiple voices in one model
Limitations
- Quality: Good but not as natural as VITS or Matcha
- Limited Languages: Fewer language options
- No Voice Cloning: Fixed voice set only
Zipvoice
modelType: 'zipvoice'
Description
Zipvoice is a zero-shot voice cloning model that can synthesize speech in any voice from a short reference audio sample.Characteristics
- Streaming: ❌ No (batch only for voice cloning)
- Quality: ⭐⭐⭐⭐⭐ Excellent
- Speed: ⭐⭐⭐ Medium
- Memory: ⭐⭐ High (requires significant RAM)
- Size: Large (~605 MB for full model)
- Voice Cloning: ✅ Yes
Architecture
Zipvoice uses a three-stage pipeline:- Encoder – Encodes reference audio
- Decoder (flow-matching) – Generates mel-spectrogram
- Vocoder (e.g.
vocos_24khz.onnx) – Converts to waveform
Configuration
Memory Requirements
Download
Zipvoice Models
Download Zipvoice models (full and int8 distill variants)
Model Detection
Zipvoice is detected by file layout:- Encoder + decoder + vocoder files
- Optional folder name pattern (containing
zipvoice) - Files: encoder, decoder,
vocos_*.onnx(vocoder),tokens.txt,lexicon.txt,espeak-ng-data
When to Use
Custom Voices
Synthesize speech in any voice from reference audio
Voice Cloning Apps
Apps that need user-specific voice synthesis
Dubbing & Translation
Translate content while preserving original voice
Personalization
Personalized voice experiences
Advantages
- Zero-Shot Voice Cloning: Clone any voice from short audio
- Excellent Quality: Very natural-sounding output
- Flexible: Works with various reference voices
- Multilingual: Supports Chinese and English
Limitations
- High Memory: Full model needs 8+ GB device RAM
- No Streaming: Voice cloning only supports batch generation
- Large Size: ~605 MB (use int8 distill variant for smaller size)
- Slower: Flow-matching is computationally intensive
- Requires Vocoder: Distill-only models (no vocoder) will fail
Reference Audio Requirements
- Format: Mono, float PCM samples in [-1, 1]
- Sample Rate: Typically 22050 Hz or 24000 Hz
- Duration: 3-10 seconds recommended
- Quality: Clear speech, minimal background noise
- Transcript: Must provide accurate transcript of reference audio
modelType: 'pocket'
Description
Pocket is a flow-matching TTS model that supports both standard synthesis and voice cloning with reference audio.Characteristics
- Streaming: ✅ Yes (including with reference audio for Kotlin-engine models)
- Quality: ⭐⭐⭐⭐ High
- Speed: ⭐⭐⭐⭐ Fast
- Memory: ⭐⭐⭐ Moderate
- Size: Medium
- Voice Cloning: ✅ Yes
Configuration
Streaming with Voice Cloning
Unlike Zipvoice, Pocket supports streaming even with reference audio:Extra Options
Pocket accepts model-specific options via theextra parameter:
Download
Pocket Models
Download Pocket TTS models
Model Detection
Pocket is detected by file layout:- Files:
lm_flow,lm_main,text_conditioner,vocab/token_scores - No folder name pattern required
When to Use
Voice Cloning + Streaming
Need both voice cloning and low-latency streaming
Modern Architecture
Flow-matching for high-quality synthesis
Flexible Options
Fine-grained control with extra parameters
Interactive Apps
Real-time custom voice applications
Advantages
- Streaming + Voice Cloning: Supports both simultaneously
- Flow-Matching: Modern architecture for quality
- Fast: Good performance with streaming
- Flexible: Extra options for fine-tuning
- Good Quality: Natural-sounding speech
Limitations
- Newer: Less battle-tested than VITS or Zipvoice
- Documentation: Fewer examples and resources
- Model Availability: Fewer pretrained models
Comparison Table
| Feature | KittenTTS | Zipvoice | |
|---|---|---|---|
| Speed | Very Fast | Medium | Fast |
| Quality | Good | Excellent | High |
| Streaming | Yes | No | Yes |
| Voice Cloning | No | Yes | Yes |
| Model Size | Very Small | Large | Medium |
| Memory | Very Low | High | Moderate |
| Best For | Low-end devices | High-quality cloning | Streaming + cloning |
Choosing Between Models
For Voice Cloning
- Zipvoice – Best quality, batch generation only, high memory
- Pocket – Streaming support, good quality, moderate memory
For Speed
- KittenTTS – Fastest, lightweight
- Pocket – Fast with streaming
For Low-End Devices
- KittenTTS – Minimal resources
- Zipvoice int8 distill – If voice cloning is needed
For High Quality
- Zipvoice – Excellent voice cloning quality
- Pocket – Good quality with more flexibility
Next Steps
TTS Overview
Compare all TTS model types
TTS API
Detailed API documentation
Streaming TTS
Low-latency streaming guide
Model Setup
How to download and bundle models