Skip to main content

Text-to-Speech Models Overview

react-native-sherpa-onnx supports multiple TTS model architectures, from fast VITS models to high-quality voice cloning with Zipvoice. This guide helps you choose the right model for your application.

Model Comparison

VITS Models

Fast, high-quality TTS from Piper, Coqui, MeloTTS, and MMS

Matcha Models

High-quality acoustic model with vocoder for natural speech

Kokoro Models

Multi-speaker, multi-language TTS models

Other Models

KittenTTS, Zipvoice (voice cloning), and Pocket (flow-matching)

Quick Comparison Table

Model TypeStreamingMulti-SpeakerVoice CloningSpeedQuality
VITS✅ Yes✅ Yes❌ NoVery FastHigh
Matcha✅ Yes✅ Yes❌ NoFastVery High
Kokoro✅ Yes✅ Yes❌ NoFastHigh
KittenTTS✅ Yes✅ Yes❌ NoVery FastGood
Zipvoice❌ No✅ Yes✅ YesMediumVery High
Pocket✅ Yes✅ Yes✅ YesFastHigh

Choosing a Model

For Fast, Real-Time TTS

If you need low latency and streaming playback:
  • VITS (Piper) – Fastest, excellent quality, many voices
  • KittenTTS – Lightweight, fast, multi-speaker
  • Kokoro – Fast with multi-language support
  • Pocket – Flow-matching with streaming and voice cloning

For Voice Cloning

If you need to clone voices from reference audio:
  • Zipvoice – High-quality zero-shot voice cloning (encoder + decoder + vocoder)
  • Pocket – Flow-matching TTS with reference audio support

For High Quality

If naturalness is your priority:
  • Matcha – High-quality acoustic model + vocoder
  • Zipvoice – Excellent quality with voice cloning
  • VITS – Great balance of speed and quality

By Language Support

English:
  • VITS (Piper) – Many voices
  • Matcha
  • Kokoro
  • KittenTTS
Multilingual:
  • Kokoro (multi-language)
  • MeloTTS (subset of VITS)
  • Zipvoice (Chinese + English)
Chinese:
  • Zipvoice (excellent for Chinese)
  • VITS variants

By Device Constraints

Low-end devices / limited RAM:
  • VITS (small, fast)
  • KittenTTS (lightweight)
  • Use int8 quantized variants
High-end devices:
  • Matcha (high quality)
  • Zipvoice (voice cloning, but needs memory)
  • Pocket (flow-matching)
Zipvoice Memory Requirements: Full Zipvoice models (~605 MB) require significant RAM. On devices with less than 8 GB RAM, use the int8 distill variant (sherpa-onnx-zipvoice-distill-int8-zh-en-emilia, ~104 MB) instead.

Model Detection

The SDK automatically detects TTS model types based on file layouts:
import { createTTS, detectTtsModel } from 'react-native-sherpa-onnx/tts';

// Auto-detect model type
const detectedInfo = await detectTtsModel({
  type: 'asset',
  path: 'models/vits-piper-en'
});
console.log(detectedInfo.modelType); // 'vits'

// Create TTS with auto-detection
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
  modelType: 'auto', // Auto-detect
});

Performance Tips

Use Streaming TTS

For low latency, use streaming generation:
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
  modelType: 'vits',
});

const sampleRate = await tts.getSampleRate();
await tts.startPcmPlayer(sampleRate, 1);

await tts.generateSpeechStream('Hello world', { sid: 0, speed: 1.0 }, {
  onChunk: async (chunk) => {
    await tts.writePcmChunk(chunk.samples); // Immediate playback
  },
  onEnd: async () => {
    await tts.stopPcmPlayer();
  },
});
See the Streaming TTS Guide for more details.

Optimize Thread Count

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper' },
  numThreads: 4, // More threads = faster generation
});

Use Hardware Acceleration

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper' },
  provider: 'nnapi', // Android NNAPI
  // provider: 'xnnpack', // XNNPACK
});
See the Execution Providers guide for more details.

Tune Model Parameters

Adjust model-specific parameters for better quality or speed:
// VITS: noise and length scale
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper' },
  modelType: 'vits',
  modelOptions: {
    vits: {
      noiseScale: 0.667,   // Lower = clearer (less variation)
      noiseScaleW: 0.8,    // Duration noise
      lengthScale: 1.0,    // Speech speed (< 1.0 = faster)
    }
  },
});

// Kokoro: length scale only
const ttsKokoro = await createTTS({
  modelPath: { type: 'asset', path: 'models/kokoro' },
  modelType: 'kokoro',
  modelOptions: {
    kokoro: { lengthScale: 1.2 } // Slower speech
  },
});

Streaming vs Batch Generation

Batch Generation

Generate the entire audio buffer at once:
const audio = await tts.generateSpeech('Hello world', { sid: 0, speed: 1.0 });
console.log('Sample rate:', audio.sampleRate);
console.log('Samples:', audio.samples.length);

// Save to file
import { saveAudioToFile } from 'react-native-sherpa-onnx/tts';
await saveAudioToFile(audio, '/path/to/output.wav');

Streaming Generation

Receive incremental chunks for low-latency playback:
await tts.generateSpeechStream('Hello world', { sid: 0, speed: 1.0 }, {
  onChunk: (chunk) => {
    // Play chunk.samples immediately
    console.log('Chunk:', chunk.samples.length, 'samples');
  },
  onEnd: () => {
    console.log('Generation complete');
  },
});
Streaming is recommended for:
  • Interactive voice applications
  • Long text generation
  • Low time-to-first-byte
All TTS model downloads are available from:

TTS Models Repository

Download VITS, Kokoro, KittenTTS, and Pocket models
Additional specialized models:

Voice Cloning

For applications that need to synthesize speech in a custom voice, use models that support reference audio:

Zipvoice (Full Voice Cloning)

Best quality, requires full model (encoder + decoder + vocoder):
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/zipvoice-zh-en' },
  modelType: 'zipvoice',
});

const audio = await tts.generateSpeech('Target text to speak', {
  referenceAudio: { samples: refSamples, sampleRate: 22050 },
  referenceText: 'Transcript of the reference recording',
  speed: 1.0,
});
Zipvoice Distill: Models with only encoder + decoder (no vocoder) will fail during initialization. Use full Zipvoice models with a vocoder file (e.g. vocos_24khz.onnx).

Pocket (Flow-Matching with Reference Audio)

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/pocket' },
  modelType: 'pocket',
});

const audio = await tts.generateSpeech('Target text', {
  referenceAudio: { samples: refSamples, sampleRate: 22050 },
  referenceText: 'Reference transcript',
  numSteps: 20,
  extra: { temperature: '0.7' },
});
See the TTS API Reference for more details on voice cloning.

Common Use Cases

Voice Assistants

Use VITS or KittenTTS for fast, interactive responses

Audiobook Narration

Use Matcha or Zipvoice for high-quality, natural speech

Real-Time Translation

Use streaming TTS (VITS, Kokoro) for low latency

Custom Voice Apps

Use Zipvoice or Pocket for voice cloning

E-Learning

Use VITS (Piper) for clear, consistent narration

Accessibility

Use fast streaming TTS for screen readers

Multi-Speaker Models

Many TTS models support multiple speakers (voices). Use the sid (speaker ID) parameter:
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-multi' },
  modelType: 'vits',
});

const numSpeakers = await tts.getNumSpeakers();
console.log('Available speakers:', numSpeakers);

// Generate with different speakers
const audio1 = await tts.generateSpeech('Hello', { sid: 0 });
const audio2 = await tts.generateSpeech('Hello', { sid: 1 });

Sample Rate Handling

Different models output different sample rates (typically 16000, 22050, or 24000 Hz). Always check the model’s sample rate:
const tts = await createTTS({ ... });
const sampleRate = await tts.getSampleRate();
console.log('Model sample rate:', sampleRate);

// Use this for playback
await tts.startPcmPlayer(sampleRate, 1); // mono
If you need a specific sample rate for your playback system, resample the audio using the Audio Conversion API.

Next Steps

TTS API Reference

Detailed API documentation

Streaming TTS

Low-latency streaming generation

Model Setup

How to download and bundle models

Execution Providers

Hardware acceleration options

Build docs developers (and LLMs) love