Text-to-Speech Models Overview

react-native-sherpa-onnx supports multiple TTS model architectures, from fast VITS models to high-quality voice cloning with Zipvoice. This guide helps you choose the right model for your application.

Model Comparison

VITS Models

Fast, high-quality TTS from Piper, Coqui, MeloTTS, and MMS

Matcha Models

High-quality acoustic model with vocoder for natural speech

Kokoro Models

Multi-speaker, multi-language TTS models

Other Models

KittenTTS, Zipvoice (voice cloning), and Pocket (flow-matching)

Quick Comparison Table

Model Type	Streaming	Multi-Speaker	Voice Cloning	Speed	Quality
VITS	✅ Yes	✅ Yes	❌ No	Very Fast	High
Matcha	✅ Yes	✅ Yes	❌ No	Fast	Very High
Kokoro	✅ Yes	✅ Yes	❌ No	Fast	High
KittenTTS	✅ Yes	✅ Yes	❌ No	Very Fast	Good
Zipvoice	❌ No	✅ Yes	✅ Yes	Medium	Very High
Pocket	✅ Yes	✅ Yes	✅ Yes	Fast	High

Choosing a Model

For Fast, Real-Time TTS

If you need low latency and streaming playback:

VITS (Piper) – Fastest, excellent quality, many voices
KittenTTS – Lightweight, fast, multi-speaker
Kokoro – Fast with multi-language support
Pocket – Flow-matching with streaming and voice cloning

For Voice Cloning

If you need to clone voices from reference audio:

Zipvoice – High-quality zero-shot voice cloning (encoder + decoder + vocoder)
Pocket – Flow-matching TTS with reference audio support

For High Quality

If naturalness is your priority:

Matcha – High-quality acoustic model + vocoder
Zipvoice – Excellent quality with voice cloning
VITS – Great balance of speed and quality

By Language Support

English:

VITS (Piper) – Many voices
Matcha
Kokoro
KittenTTS

Multilingual:

Kokoro (multi-language)
MeloTTS (subset of VITS)
Zipvoice (Chinese + English)

Chinese:

Zipvoice (excellent for Chinese)
VITS variants

By Device Constraints

Low-end devices / limited RAM:

VITS (small, fast)
KittenTTS (lightweight)
Use int8 quantized variants

High-end devices:

Matcha (high quality)
Zipvoice (voice cloning, but needs memory)
Pocket (flow-matching)

Zipvoice Memory Requirements: Full Zipvoice models (~605 MB) require significant RAM. On devices with less than 8 GB RAM, use the int8 distill variant (sherpa-onnx-zipvoice-distill-int8-zh-en-emilia, ~104 MB) instead.

Model Detection

The SDK automatically detects TTS model types based on file layouts:

import { createTTS, detectTtsModel } from 'react-native-sherpa-onnx/tts';

// Auto-detect model type
const detectedInfo = await detectTtsModel({
  type: 'asset',
  path: 'models/vits-piper-en'
});
console.log(detectedInfo.modelType); // 'vits'

// Create TTS with auto-detection
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
  modelType: 'auto', // Auto-detect
});

Performance Tips

Use Streaming TTS

For low latency, use streaming generation:

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
  modelType: 'vits',
});

const sampleRate = await tts.getSampleRate();
await tts.startPcmPlayer(sampleRate, 1);

await tts.generateSpeechStream('Hello world', { sid: 0, speed: 1.0 }, {
  onChunk: async (chunk) => {
    await tts.writePcmChunk(chunk.samples); // Immediate playback
  },
  onEnd: async () => {
    await tts.stopPcmPlayer();
  },
});

See the Streaming TTS Guide for more details.

Optimize Thread Count

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper' },
  numThreads: 4, // More threads = faster generation
});

Use Hardware Acceleration

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper' },
  provider: 'nnapi', // Android NNAPI
  // provider: 'xnnpack', // XNNPACK
});

See the Execution Providers guide for more details.

Tune Model Parameters

Adjust model-specific parameters for better quality or speed:

// VITS: noise and length scale
const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper' },
  modelType: 'vits',
  modelOptions: {
    vits: {
      noiseScale: 0.667,   // Lower = clearer (less variation)
      noiseScaleW: 0.8,    // Duration noise
      lengthScale: 1.0,    // Speech speed (< 1.0 = faster)
    }
  },
});

// Kokoro: length scale only
const ttsKokoro = await createTTS({
  modelPath: { type: 'asset', path: 'models/kokoro' },
  modelType: 'kokoro',
  modelOptions: {
    kokoro: { lengthScale: 1.2 } // Slower speech
  },
});

Streaming vs Batch Generation

Batch Generation

Generate the entire audio buffer at once:

const audio = await tts.generateSpeech('Hello world', { sid: 0, speed: 1.0 });
console.log('Sample rate:', audio.sampleRate);
console.log('Samples:', audio.samples.length);

// Save to file
import { saveAudioToFile } from 'react-native-sherpa-onnx/tts';
await saveAudioToFile(audio, '/path/to/output.wav');

Streaming Generation

Receive incremental chunks for low-latency playback:

await tts.generateSpeechStream('Hello world', { sid: 0, speed: 1.0 }, {
  onChunk: (chunk) => {
    // Play chunk.samples immediately
    console.log('Chunk:', chunk.samples.length, 'samples');
  },
  onEnd: () => {
    console.log('Generation complete');
  },
});

Streaming is recommended for:

Interactive voice applications
Long text generation
Low time-to-first-byte

Download Links

All TTS model downloads are available from:

TTS Models Repository

Download VITS, Kokoro, KittenTTS, and Pocket models

Additional specialized models:

Voice Cloning

For applications that need to synthesize speech in a custom voice, use models that support reference audio:

Zipvoice (Full Voice Cloning)

Best quality, requires full model (encoder + decoder + vocoder):

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/zipvoice-zh-en' },
  modelType: 'zipvoice',
});

const audio = await tts.generateSpeech('Target text to speak', {
  referenceAudio: { samples: refSamples, sampleRate: 22050 },
  referenceText: 'Transcript of the reference recording',
  speed: 1.0,
});

Zipvoice Distill: Models with only encoder + decoder (no vocoder) will fail during initialization. Use full Zipvoice models with a vocoder file (e.g. vocos_24khz.onnx).

Pocket (Flow-Matching with Reference Audio)

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/pocket' },
  modelType: 'pocket',
});

const audio = await tts.generateSpeech('Target text', {
  referenceAudio: { samples: refSamples, sampleRate: 22050 },
  referenceText: 'Reference transcript',
  numSteps: 20,
  extra: { temperature: '0.7' },
});

See the TTS API Reference for more details on voice cloning.

Common Use Cases

Voice Assistants

Use VITS or KittenTTS for fast, interactive responses

Audiobook Narration

Use Matcha or Zipvoice for high-quality, natural speech

Real-Time Translation

Use streaming TTS (VITS, Kokoro) for low latency

Custom Voice Apps

Use Zipvoice or Pocket for voice cloning

E-Learning

Use VITS (Piper) for clear, consistent narration

Accessibility

Use fast streaming TTS for screen readers

Multi-Speaker Models

Many TTS models support multiple speakers (voices). Use the sid (speaker ID) parameter:

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-multi' },
  modelType: 'vits',
});

const numSpeakers = await tts.getNumSpeakers();
console.log('Available speakers:', numSpeakers);

// Generate with different speakers
const audio1 = await tts.generateSpeech('Hello', { sid: 0 });
const audio2 = await tts.generateSpeech('Hello', { sid: 1 });

Sample Rate Handling

Different models output different sample rates (typically 16000, 22050, or 24000 Hz). Always check the model’s sample rate:

const tts = await createTTS({ ... });
const sampleRate = await tts.getSampleRate();
console.log('Model sample rate:', sampleRate);

// Use this for playback
await tts.startPcmPlayer(sampleRate, 1); // mono

If you need a specific sample rate for your playback system, resample the audio using the Audio Conversion API.

Next Steps

TTS API Reference

Detailed API documentation

Streaming TTS

Low-latency streaming generation

Model Setup

How to download and bundle models

Execution Providers

Hardware acceleration options

Speech-to-Text Models

Text-to-Speech Models

​Text-to-Speech Models Overview

​Model Comparison

VITS Models

Matcha Models

Kokoro Models

Other Models

​Quick Comparison Table

​Choosing a Model

​For Fast, Real-Time TTS

​For Voice Cloning

​For High Quality

​By Language Support

​By Device Constraints

​Model Detection

​Performance Tips

​Use Streaming TTS

​Optimize Thread Count

​Use Hardware Acceleration

​Tune Model Parameters

​Streaming vs Batch Generation

​Batch Generation

​Streaming Generation

​Download Links

TTS Models Repository

​Voice Cloning

​Zipvoice (Full Voice Cloning)

​Pocket (Flow-Matching with Reference Audio)

​Common Use Cases

Voice Assistants

Audiobook Narration

Real-Time Translation

Custom Voice Apps

E-Learning

Accessibility

​Multi-Speaker Models

​Sample Rate Handling

​Next Steps

TTS API Reference

Streaming TTS

Model Setup

Execution Providers

Build docs developers (and LLMs) love

Text-to-Speech Models Overview

Model Comparison

Quick Comparison Table

Choosing a Model

For Fast, Real-Time TTS

For Voice Cloning

For High Quality

By Language Support

By Device Constraints

Model Detection

Performance Tips

Use Streaming TTS

Optimize Thread Count

Use Hardware Acceleration

Tune Model Parameters

Streaming vs Batch Generation

Batch Generation

Streaming Generation

Download Links

Voice Cloning

Zipvoice (Full Voice Cloning)

Pocket (Flow-Matching with Reference Audio)

Common Use Cases

Multi-Speaker Models

Sample Rate Handling

Next Steps