Text-to-Speech Models Overview
react-native-sherpa-onnx supports multiple TTS model architectures, from fast VITS models to high-quality voice cloning with Zipvoice. This guide helps you choose the right model for your application.
Model Comparison
VITS Models Fast, high-quality TTS from Piper, Coqui, MeloTTS, and MMS
Matcha Models High-quality acoustic model with vocoder for natural speech
Kokoro Models Multi-speaker, multi-language TTS models
Other Models KittenTTS, Zipvoice (voice cloning), and Pocket (flow-matching)
Quick Comparison Table
Model Type Streaming Multi-Speaker Voice Cloning Speed Quality VITS ✅ Yes ✅ Yes ❌ No Very Fast High Matcha ✅ Yes ✅ Yes ❌ No Fast Very High Kokoro ✅ Yes ✅ Yes ❌ No Fast High KittenTTS ✅ Yes ✅ Yes ❌ No Very Fast Good Zipvoice ❌ No ✅ Yes ✅ Yes Medium Very High Pocket ✅ Yes ✅ Yes ✅ Yes Fast High
Choosing a Model
For Fast, Real-Time TTS
If you need low latency and streaming playback :
VITS (Piper) – Fastest, excellent quality, many voices
KittenTTS – Lightweight, fast, multi-speaker
Kokoro – Fast with multi-language support
Pocket – Flow-matching with streaming and voice cloning
For Voice Cloning
If you need to clone voices from reference audio:
Zipvoice – High-quality zero-shot voice cloning (encoder + decoder + vocoder)
Pocket – Flow-matching TTS with reference audio support
For High Quality
If naturalness is your priority:
Matcha – High-quality acoustic model + vocoder
Zipvoice – Excellent quality with voice cloning
VITS – Great balance of speed and quality
By Language Support
English:
VITS (Piper) – Many voices
Matcha
Kokoro
KittenTTS
Multilingual:
Kokoro (multi-language)
MeloTTS (subset of VITS)
Zipvoice (Chinese + English)
Chinese:
Zipvoice (excellent for Chinese)
VITS variants
By Device Constraints
Low-end devices / limited RAM:
VITS (small, fast)
KittenTTS (lightweight)
Use int8 quantized variants
High-end devices:
Matcha (high quality)
Zipvoice (voice cloning, but needs memory)
Pocket (flow-matching)
Zipvoice Memory Requirements : Full Zipvoice models (~605 MB) require significant RAM. On devices with less than 8 GB RAM , use the int8 distill variant (sherpa-onnx-zipvoice-distill-int8-zh-en-emilia, ~104 MB) instead.
Model Detection
The SDK automatically detects TTS model types based on file layouts:
import { createTTS , detectTtsModel } from 'react-native-sherpa-onnx/tts' ;
// Auto-detect model type
const detectedInfo = await detectTtsModel ({
type: 'asset' ,
path: 'models/vits-piper-en'
});
console . log ( detectedInfo . modelType ); // 'vits'
// Create TTS with auto-detection
const tts = await createTTS ({
modelPath: { type: 'asset' , path: 'models/vits-piper-en' },
modelType: 'auto' , // Auto-detect
});
Use Streaming TTS
For low latency, use streaming generation:
const tts = await createTTS ({
modelPath: { type: 'asset' , path: 'models/vits-piper-en' },
modelType: 'vits' ,
});
const sampleRate = await tts . getSampleRate ();
await tts . startPcmPlayer ( sampleRate , 1 );
await tts . generateSpeechStream ( 'Hello world' , { sid: 0 , speed: 1.0 }, {
onChunk : async ( chunk ) => {
await tts . writePcmChunk ( chunk . samples ); // Immediate playback
},
onEnd : async () => {
await tts . stopPcmPlayer ();
},
});
See the Streaming TTS Guide for more details.
Optimize Thread Count
const tts = await createTTS ({
modelPath: { type: 'asset' , path: 'models/vits-piper' },
numThreads: 4 , // More threads = faster generation
});
Use Hardware Acceleration
const tts = await createTTS ({
modelPath: { type: 'asset' , path: 'models/vits-piper' },
provider: 'nnapi' , // Android NNAPI
// provider: 'xnnpack', // XNNPACK
});
See the Execution Providers guide for more details.
Tune Model Parameters
Adjust model-specific parameters for better quality or speed:
// VITS: noise and length scale
const tts = await createTTS ({
modelPath: { type: 'asset' , path: 'models/vits-piper' },
modelType: 'vits' ,
modelOptions: {
vits: {
noiseScale: 0.667 , // Lower = clearer (less variation)
noiseScaleW: 0.8 , // Duration noise
lengthScale: 1.0 , // Speech speed (< 1.0 = faster)
}
},
});
// Kokoro: length scale only
const ttsKokoro = await createTTS ({
modelPath: { type: 'asset' , path: 'models/kokoro' },
modelType: 'kokoro' ,
modelOptions: {
kokoro: { lengthScale: 1.2 } // Slower speech
},
});
Streaming vs Batch Generation
Batch Generation
Generate the entire audio buffer at once:
const audio = await tts . generateSpeech ( 'Hello world' , { sid: 0 , speed: 1.0 });
console . log ( 'Sample rate:' , audio . sampleRate );
console . log ( 'Samples:' , audio . samples . length );
// Save to file
import { saveAudioToFile } from 'react-native-sherpa-onnx/tts' ;
await saveAudioToFile ( audio , '/path/to/output.wav' );
Streaming Generation
Receive incremental chunks for low-latency playback:
await tts . generateSpeechStream ( 'Hello world' , { sid: 0 , speed: 1.0 }, {
onChunk : ( chunk ) => {
// Play chunk.samples immediately
console . log ( 'Chunk:' , chunk . samples . length , 'samples' );
},
onEnd : () => {
console . log ( 'Generation complete' );
},
});
Streaming is recommended for:
Interactive voice applications
Long text generation
Low time-to-first-byte
Download Links
All TTS model downloads are available from:
TTS Models Repository Download VITS, Kokoro, KittenTTS, and Pocket models
Additional specialized models:
Voice Cloning
For applications that need to synthesize speech in a custom voice , use models that support reference audio:
Zipvoice (Full Voice Cloning)
Best quality, requires full model (encoder + decoder + vocoder):
const tts = await createTTS ({
modelPath: { type: 'asset' , path: 'models/zipvoice-zh-en' },
modelType: 'zipvoice' ,
});
const audio = await tts . generateSpeech ( 'Target text to speak' , {
referenceAudio: { samples: refSamples , sampleRate: 22050 },
referenceText: 'Transcript of the reference recording' ,
speed: 1.0 ,
});
Zipvoice Distill : Models with only encoder + decoder (no vocoder) will fail during initialization. Use full Zipvoice models with a vocoder file (e.g. vocos_24khz.onnx).
Pocket (Flow-Matching with Reference Audio)
const tts = await createTTS ({
modelPath: { type: 'asset' , path: 'models/pocket' },
modelType: 'pocket' ,
});
const audio = await tts . generateSpeech ( 'Target text' , {
referenceAudio: { samples: refSamples , sampleRate: 22050 },
referenceText: 'Reference transcript' ,
numSteps: 20 ,
extra: { temperature: '0.7' },
});
See the TTS API Reference for more details on voice cloning.
Common Use Cases
Voice Assistants Use VITS or KittenTTS for fast, interactive responses
Audiobook Narration Use Matcha or Zipvoice for high-quality, natural speech
Real-Time Translation Use streaming TTS (VITS, Kokoro) for low latency
Custom Voice Apps Use Zipvoice or Pocket for voice cloning
E-Learning Use VITS (Piper) for clear, consistent narration
Accessibility Use fast streaming TTS for screen readers
Multi-Speaker Models
Many TTS models support multiple speakers (voices). Use the sid (speaker ID) parameter:
const tts = await createTTS ({
modelPath: { type: 'asset' , path: 'models/vits-piper-multi' },
modelType: 'vits' ,
});
const numSpeakers = await tts . getNumSpeakers ();
console . log ( 'Available speakers:' , numSpeakers );
// Generate with different speakers
const audio1 = await tts . generateSpeech ( 'Hello' , { sid: 0 });
const audio2 = await tts . generateSpeech ( 'Hello' , { sid: 1 });
Sample Rate Handling
Different models output different sample rates (typically 16000, 22050, or 24000 Hz). Always check the model’s sample rate:
const tts = await createTTS ({ ... });
const sampleRate = await tts . getSampleRate ();
console . log ( 'Model sample rate:' , sampleRate );
// Use this for playback
await tts . startPcmPlayer ( sampleRate , 1 ); // mono
If you need a specific sample rate for your playback system, resample the audio using the Audio Conversion API .
Next Steps
TTS API Reference Detailed API documentation
Streaming TTS Low-latency streaming generation
Model Setup How to download and bundle models
Execution Providers Hardware acceleration options