Kokoro Models
Kokoro is a multi-speaker, multi-language TTS model designed for flexible speech synthesis across different voices and languages.Model Architecture
Kokoro uses an end-to-end neural TTS architecture:- Model (
model.onnxorkokoro-*.onnx) – Neural TTS model - Tokens (
tokens.txt) – Text token vocabulary - Optional configuration files
When to Use
Multi-Language Apps
Applications serving users in multiple languages
Multiple Voices
Need for different speakers/voices in one model
Fast Streaming
Real-time speech generation with low latency
Compact Deployment
Single model for multiple languages and voices
Supported Languages
Kokoro models support multiple languages including:- English
- Spanish
- French
- German
- And potentially others (check specific model variant)
Performance Characteristics
| Aspect | Rating | Notes |
|---|---|---|
| Streaming | ✅ Excellent | Native streaming support |
| Quality | ⭐⭐⭐⭐ | High quality, natural speech |
| Speed | ⭐⭐⭐⭐⭐ | Fast inference |
| Memory | ⭐⭐⭐⭐ | Moderate, suitable for mobile |
| Model Size | Small-Medium | Typically 20-60 MB |
Download Links
Kokoro Models
Download Kokoro TTS models
Configuration Example
Basic TTS
With Length Scale
Streaming TTS
Multi-Speaker Usage
Model Options
Kokoro models support one tuning parameter:| Option | Type | Default | Description |
|---|---|---|---|
lengthScale | number | 1.0 | Speech speed. < 1.0 = faster, > 1.0 = slower |
Tuning Examples
Runtime Updates
Model Detection
Kokoro models are detected by:- Folder name should contain
kokoro(notkitten) - Files:
model.onnxorkokoro-*.onnx, plustokens.txt
model.onnx(or variant)tokens.txt
Performance Tips
Optimize Thread Count
Use Streaming for Responsiveness
For interactive apps:Hardware Acceleration
Streaming Support
Streaming: ✅ YesKokoro models have excellent streaming support. Use
generateSpeechStream() for low-latency, incremental audio generation.Advantages
- Multi-Language: Single model for multiple languages
- Multi-Speaker: Multiple voices in one model
- Fast Inference: Real-time capable
- Streaming: Native incremental generation
- Compact: One model instead of multiple language-specific models
- Good Quality: Natural-sounding speech
Limitations
- No Voice Cloning: Cannot synthesize custom voices from reference audio
- Fixed Voices: Limited to model’s trained speakers
- Less Tuning: Only
lengthScaleparameter available (no noise scale) - Language Coverage: Fewer languages than some alternatives
Use Cases
Multilingual Apps
Apps serving users in multiple countries
Voice Assistants
Interactive voice interfaces with multiple voices
E-Learning
Educational content in multiple languages
Customer Service
Automated responses in different languages
Common Issues
Model not loading
Model not loading
- Verify folder name contains
kokoro(notkitten) - Check that
model.onnxandtokens.txtare present - Ensure sufficient device memory
Incorrect language output
Incorrect language output
- Kokoro may auto-detect language from text
- Ensure input text is in the correct language/script
- Some models may require language-specific prefixes
Slow generation
Slow generation
- Increase
numThreadson multi-core devices - Use hardware acceleration (
provider: 'nnapi') - Ensure no other heavy apps are running
Comparison with Other Models
| Feature | Kokoro | VITS | Matcha | KittenTTS |
|---|---|---|---|---|
| Speed | Fast | Very Fast | Fast | Very Fast |
| Quality | High | High | Very High | Good |
| Streaming | Yes | Yes | Yes | Yes |
| Multi-Language | Yes | Varies | Limited | Limited |
| Multi-Speaker | Yes | Yes | Yes | Yes |
| Voice Cloning | No | No | No | No |
| Model Size | Small-Medium | Small | Medium | Small |
Next Steps
TTS API
Detailed API documentation
Streaming TTS
Low-latency streaming guide
Model Setup
How to download and bundle models
Execution Providers
Hardware acceleration options