Overview
The TTS (Text-to-Speech) module provides offline speech synthesis using various model types. Generate complete audio from text, with support for multiple speakers, adjustable speed, and model-specific parameters. Key features:- Multiple model types (VITS, Matcha, Kokoro, Kitten, Pocket, Zipvoice)
- Multi-speaker support (speaker selection by ID)
- Adjustable speech speed
- Voice cloning with reference audio (Pocket, Zipvoice)
- Timestamp generation
- Save to WAV files or play directly
Quick Start
Supported Model Types
| Model Type | Description | Config Options |
|---|---|---|
vits | VITS models (Piper, Coqui, MeloTTS, MMS) | noiseScale, noiseScaleW, lengthScale |
matcha | Matcha models | noiseScale, lengthScale |
kokoro | Kokoro (multi-speaker, multi-language) | lengthScale |
kitten | KittenTTS (lightweight) | lengthScale |
pocket | Pocket TTS (voice cloning) | Voice cloning via referenceAudio |
zipvoice | Zipvoice (voice cloning) | Voice cloning via referenceAudio |
auto | Auto-detect from files | — |
API Reference
createTTS(options)
Creates a TTS engine for batch (one-shot) speech generation.src/tts/index.ts
Model directory path. Use
{ type: 'asset', path: 'models/...' } for bundled assets.Model type:
'vits', 'matcha', 'kokoro', 'kitten', 'pocket', 'zipvoice', or 'auto'.Number of threads for inference. More threads = faster but more CPU usage.
Execution provider (e.g.,
'cpu', 'coreml', 'xnnpack'). See Execution Providers.Enable debug logging.
Model-specific configuration. Only the block for the loaded model type is applied:
vits:{ noiseScale, noiseScaleW, lengthScale }matcha:{ noiseScale, lengthScale }kokoro:{ lengthScale }kitten:{ lengthScale }
Path(s) to rule FSTs for text normalization.
Path(s) to rule FARs for text normalization.
Max sentences per streaming callback.
Silence scale on config level.
TtsEngine: generateSpeech(text, options?)
Generate speech audio from text.GeneratedAudio:
Speaker ID for multi-speaker models. Use
getNumSpeakers() to check available speakers.Speech speed multiplier:
1.0= normal speed0.5= half speed (slower)2.0= double speed (faster)
Silence scale at generation time (model-dependent).
Reference audio for voice cloning (Pocket, Zipvoice). Mono float samples in [-1, 1].
Transcript of reference audio (required when using referenceAudio).
Flow-matching steps (Pocket TTS).
Model-specific options (e.g., Pocket:
{ temperature: '0.7', chunk_size: '15' }).TtsEngine: generateSpeechWithTimestamps(text, options?)
Generate speech with word-level timestamps.TtsEngine: updateParams(options)
Update model parameters at runtime without reloading.TtsEngine: getModelInfo()
Get model information (sample rate and number of speakers).TtsEngine: getSampleRate()
Get the model’s sample rate.TtsEngine: getNumSpeakers()
Get the number of available speakers.TtsEngine: destroy()
Release native resources. Must be called when done.Saving Audio
Save to File
Android: Save via SAF (Storage Access Framework)
Share Audio File
Model-Specific Configuration
VITS Models
VITS models support three tuning parameters:Matcha Models
Kokoro Models
Voice Cloning
Pocket and Zipvoice models support voice cloning via reference audio.Pocket TTS (Voice Cloning)
Zipvoice (Voice Cloning)
Zipvoice Memory Requirements:The full fp32 Zipvoice model (~605 MB) requires significant RAM. On devices with less than 8 GB RAM, use the int8 distill variant (
sherpa-onnx-zipvoice-distill-int8-zh-en-emilia, ~104 MB) to avoid crashes.The SDK checks free memory before loading and rejects initialization if below ~800 MB.Multi-Speaker Models
Some models include multiple speakers (voices).Model Detection
Detect TTS model type without initializing:Performance Optimization
Threading
Hardware Acceleration
Speed Control
Adjust speech speed at generation time:Common Use Cases
Generate and Play
Batch Generation
Dynamic Speaker Selection
Troubleshooting
Error: TTS initialization failed
Error: TTS initialization failed
- Verify model directory exists and contains required files
- For VITS: need
model.onnx,tokens.txt,espeak-ng-data(some models) - For Zipvoice: need encoder, decoder, vocoder, tokens, lexicon, espeak-ng-data
- Try
modelType: 'auto'for automatic detection - Enable
debug: truefor detailed logs
Out of memory with Zipvoice
Out of memory with Zipvoice
The full Zipvoice model (~605 MB) requires significant RAM:
- Use the int8 distill variant:
sherpa-onnx-zipvoice-distill-int8-zh-en-emilia(~104 MB) - Close other apps to free memory
- Target devices with 8+ GB RAM for full model
Audio sounds robotic or poor quality
Audio sounds robotic or poor quality
- Adjust
noiseScale(VITS/Matcha): try 0.667-1.0 - Adjust
lengthScale: values close to 1.0 are more natural - Try a larger/better model
- Increase
numStepsfor flow-matching models (Pocket)
Speech too fast or too slow
Speech too fast or too slow
Use the Or adjust
speed parameter at generation time:lengthScale in model options (permanent).Voice cloning not working
Voice cloning not working
- Ensure model supports voice cloning (Pocket, Zipvoice)
- Reference audio should be 3-10 seconds, clear, mono
- Provide accurate
referenceTexttranscript - For Zipvoice, use
generateSpeech()not streaming - Increase
numStepsfor better quality
Next Steps
Streaming TTS
Low-latency streaming generation
Model Setup
Learn how to bundle and load models
Speech-to-Text
Transcribe audio to text
Execution Providers
Hardware acceleration options