Overview
The TTS module enables high-quality speech synthesis from text. Generate complete audio buffers, adjust voice parameters, and save to files. Supports multiple model architectures with voice cloning capabilities.Quick Start
Supported Model Types
| Model Type | Description | Features |
|---|---|---|
vits | VITS (Piper) | Multi-speaker, noise/length control |
matcha | Matcha-TTS | Fast, flow-matching |
kokoro | Kokoro | Length scale control |
kitten | Kitten | Compact model |
pocket | Pocket TTS | Voice cloning, temperature control |
zipvoice | ZipVoice | Zero-shot voice cloning |
modelType: 'auto' for automatic detection.
Generate Speech
Basic Generation
With Options
Generation Options
| Option | Type | Description |
|---|---|---|
sid | number | Speaker ID for multi-speaker models (default: 0) |
speed | number | Speed multiplier (default: 1.0) |
silenceScale | number | Silence scale |
referenceAudio | { samples, sampleRate } | For voice cloning |
referenceText | string | Transcript of reference audio |
numSteps | number | Flow-matching steps (model-dependent) |
extra | Record<string, string> | Model-specific options |
Generate with Timestamps
Get word/phoneme timing information:Model-Specific Configuration
VITS
Control voice characteristics:Kokoro
Matcha
Kitten
Update Parameters at Runtime
Change voice parameters without reloading the model:Voice Cloning
Clone a voice using reference audio (Pocket, ZipVoice models):Pocket TTS Extra Options
Multi-Speaker Models
Save Audio to File
Standard File Path
Android SAF (Storage Access Framework)
Save to user-selected directories:Copy to Cache
Audio Format Conversion
Convert WAV to other formats:Get Model Information
Advanced Configuration
Text Normalization
Config-Level Options
ZipVoice Models
Full vs Distill
-
Full ZipVoice: Encoder + decoder + vocoder (e.g.,
vocos_24khz.onnx)- Required for initialization
- ~605 MB compressed (fp32)
- Needs ~8 GB RAM
-
ZipVoice Distill: Encoder + decoder only (no vocoder)
- Will fail initialization (vocoder required)
- Use full model or int8 variant instead
Memory Requirements
For devices with less than 8 GB RAM, use the int8 quantized variant:Best Practices
Memory Management
Resampling for Playback
If model outputs 22050 Hz but playback expects 48000 Hz:Performance Tips
- Threading: Increase
numThreadson multi-core devices - Quantization: Use int8 models for faster generation
- Batch processing: Reuse engine for multiple generations
- Pre-warm: Generate a short sample at startup to avoid first-use latency
Error Handling
Complete Example
Next Steps
Streaming TTS
Low-latency incremental speech generation
Model Setup
Download and configure TTS models