Other TTS Models

This page covers additional TTS model types including lightweight KittenTTS, voice cloning with Zipvoice, and flow-matching Pocket models.

Overview

KittenTTS

Lightweight, multi-speaker TTS

Zipvoice

Zero-shot voice cloning

Pocket

Flow-matching TTS with voice cloning

KittenTTS

modelType: 'kitten'

Description

KittenTTS is a lightweight, fast, multi-speaker TTS model optimized for resource-constrained devices.

Characteristics

Streaming: ✅ Yes
Quality: ⭐⭐⭐ Good
Speed: ⭐⭐⭐⭐⭐ Very Fast
Memory: ⭐⭐⭐⭐⭐ Very Low
Size: Very Small (typically 10-30 MB)
Multi-Speaker: ✅ Yes

Configuration

import { createTTS } from 'react-native-sherpa-onnx/tts';

const tts = await createTTS({
  modelPath: {
    type: 'asset',
    path: 'models/kitten-tts-en'
  },
  modelType: 'kitten', // or 'auto'
  numThreads: 2,
});

const audio = await tts.generateSpeech('Hello from KittenTTS!');
console.log('Generated:', audio.samples.length, 'samples');

await tts.destroy();

Streaming Example

const sampleRate = await tts.getSampleRate();
await tts.startPcmPlayer(sampleRate, 1);

await tts.generateSpeechStream('Fast streaming speech', { sid: 0, speed: 1.0 }, {
  onChunk: async (chunk) => {
    await tts.writePcmChunk(chunk.samples);
  },
  onEnd: async () => {
    await tts.stopPcmPlayer();
  },
});

Download

KittenTTS Models

Download KittenTTS models

Model Detection

Folder name should contain kitten (not kokoro)
Files: model.onnx, tokens.txt

When to Use

Low-End Devices

Resource-constrained mobile devices

Fast Response

Applications requiring minimal latency

Battery Efficiency

Low power consumption for longer battery life

Embedded Systems

IoT devices with limited resources

Advantages

Very Fast: Fastest TTS model available
Very Small: Minimal storage footprint
Low Memory: Runs on constrained devices
Streaming: Low-latency incremental generation
Multi-Speaker: Multiple voices in one model

Limitations

Quality: Good but not as natural as VITS or Matcha
Limited Languages: Fewer language options
No Voice Cloning: Fixed voice set only

Zipvoice

modelType: 'zipvoice'

Description

Zipvoice is a zero-shot voice cloning model that can synthesize speech in any voice from a short reference audio sample.

Characteristics

Streaming: ❌ No (batch only for voice cloning)
Quality: ⭐⭐⭐⭐⭐ Excellent
Speed: ⭐⭐⭐ Medium
Memory: ⭐⭐ High (requires significant RAM)
Size: Large (~605 MB for full model)
Voice Cloning: ✅ Yes

Architecture

Zipvoice uses a three-stage pipeline:

Encoder – Encodes reference audio
Decoder (flow-matching) – Generates mel-spectrogram
Vocoder (e.g. vocos_24khz.onnx) – Converts to waveform

Vocoder Required: Zipvoice requires a vocoder (e.g. vocos_24khz.onnx). “Distill” models with only encoder + decoder will fail during initialization.

Configuration

import { createTTS } from 'react-native-sherpa-onnx/tts';

const tts = await createTTS({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-zipvoice-zh-en-emilia'
  },
  modelType: 'zipvoice',
  numThreads: 4,
});

// Voice cloning with reference audio
const audio = await tts.generateSpeech('Target text to speak in reference voice', {
  referenceAudio: {
    samples: referenceSamples,  // float[] from reference audio
    sampleRate: 22050,
  },
  referenceText: 'Transcript of the reference audio',
  numSteps: 20,  // Flow-matching steps (higher = better quality)
  speed: 1.0,
});

await tts.destroy();

Memory Requirements

High Memory Usage: Full Zipvoice models (~605 MB) require substantial RAM:

Recommended: 8+ GB device RAM
Minimum: ~800 MB free memory

For devices with less than 8 GB RAM, use the int8 distill variant: sherpa-onnx-zipvoice-distill-int8-zh-en-emilia (~104 MB).The SDK checks free memory before loading and rejects with an error if insufficient.

Download

Zipvoice Models

Download Zipvoice models (full and int8 distill variants)

Model Detection

Zipvoice is detected by file layout:

Encoder + decoder + vocoder files
Optional folder name pattern (containing zipvoice)
Files: encoder, decoder, vocos_*.onnx (vocoder), tokens.txt, lexicon.txt, espeak-ng-data

When to Use

Custom Voices

Synthesize speech in any voice from reference audio

Voice Cloning Apps

Apps that need user-specific voice synthesis

Dubbing & Translation

Translate content while preserving original voice

Personalization

Personalized voice experiences

Advantages

Zero-Shot Voice Cloning: Clone any voice from short audio
Excellent Quality: Very natural-sounding output
Flexible: Works with various reference voices
Multilingual: Supports Chinese and English

Limitations

High Memory: Full model needs 8+ GB device RAM
No Streaming: Voice cloning only supports batch generation
Large Size: ~605 MB (use int8 distill variant for smaller size)
Slower: Flow-matching is computationally intensive
Requires Vocoder: Distill-only models (no vocoder) will fail

Reference Audio Requirements

Format: Mono, float PCM samples in [-1, 1]
Sample Rate: Typically 22050 Hz or 24000 Hz
Duration: 3-10 seconds recommended
Quality: Clear speech, minimal background noise
Transcript: Must provide accurate transcript of reference audio

Pocket

modelType: 'pocket'

Description

Pocket is a flow-matching TTS model that supports both standard synthesis and voice cloning with reference audio.

Characteristics

Streaming: ✅ Yes (including with reference audio for Kotlin-engine models)
Quality: ⭐⭐⭐⭐ High
Speed: ⭐⭐⭐⭐ Fast
Memory: ⭐⭐⭐ Moderate
Size: Medium
Voice Cloning: ✅ Yes

Configuration

import { createTTS } from 'react-native-sherpa-onnx/tts';

const tts = await createTTS({
  modelPath: {
    type: 'asset',
    path: 'models/pocket-tts'
  },
  modelType: 'pocket', // or 'auto'
  numThreads: 2,
});

// Standard generation
const audio1 = await tts.generateSpeech('Hello from Pocket TTS');

// Voice cloning with reference audio
const audio2 = await tts.generateSpeech('Target text', {
  referenceAudio: { samples: refSamples, sampleRate: 22050 },
  referenceText: 'Reference transcript',
  numSteps: 20,
  extra: {
    temperature: '0.7',
    chunk_size: '15',
  },
});

await tts.destroy();

Streaming with Voice Cloning

Unlike Zipvoice, Pocket supports streaming even with reference audio:

await tts.generateSpeechStream('Target text', {
  referenceAudio: { samples: refSamples, sampleRate: 22050 },
  referenceText: 'Reference transcript',
  numSteps: 20,
}, {
  onChunk: async (chunk) => {
    await tts.writePcmChunk(chunk.samples);
  },
});

Extra Options

Pocket accepts model-specific options via the extra parameter:

const audio = await tts.generateSpeech('Text', {
  sid: 0,
  speed: 1.0,
  numSteps: 20,
  extra: {
    temperature: '0.7',    // Sampling temperature
    chunk_size: '15',      // Processing chunk size
    // Add other model-specific options as needed
  },
});

Download

Pocket Models

Download Pocket TTS models

Model Detection

Pocket is detected by file layout:

Files: lm_flow, lm_main, text_conditioner, vocab/token_scores
No folder name pattern required

When to Use

Voice Cloning + Streaming

Need both voice cloning and low-latency streaming

Modern Architecture

Flow-matching for high-quality synthesis

Flexible Options

Fine-grained control with extra parameters

Interactive Apps

Real-time custom voice applications

Advantages

Streaming + Voice Cloning: Supports both simultaneously
Flow-Matching: Modern architecture for quality
Fast: Good performance with streaming
Flexible: Extra options for fine-tuning
Good Quality: Natural-sounding speech

Limitations

Newer: Less battle-tested than VITS or Zipvoice
Documentation: Fewer examples and resources
Model Availability: Fewer pretrained models

Comparison Table

Feature	KittenTTS	Zipvoice	Pocket
Speed	Very Fast	Medium	Fast
Quality	Good	Excellent	High
Streaming	Yes	No	Yes
Voice Cloning	No	Yes	Yes
Model Size	Very Small	Large	Medium
Memory	Very Low	High	Moderate
Best For	Low-end devices	High-quality cloning	Streaming + cloning

Choosing Between Models

For Voice Cloning

Zipvoice – Best quality, batch generation only, high memory
Pocket – Streaming support, good quality, moderate memory

For Speed

KittenTTS – Fastest, lightweight
Pocket – Fast with streaming

For Low-End Devices

KittenTTS – Minimal resources
Zipvoice int8 distill – If voice cloning is needed

For High Quality

Zipvoice – Excellent voice cloning quality
Pocket – Good quality with more flexibility

Next Steps

TTS Overview

Compare all TTS model types

TTS API

Detailed API documentation

Streaming TTS

Low-latency streaming guide

Model Setup

How to download and bundle models

Speech-to-Text Models

Text-to-Speech Models

​Other TTS Models

​Overview

KittenTTS

Zipvoice

Pocket

​KittenTTS

​Description

​Characteristics

​Configuration

​Streaming Example

​Download

KittenTTS Models

​Model Detection

​When to Use

Low-End Devices

Fast Response

Battery Efficiency

Embedded Systems

​Advantages

​Limitations

​Zipvoice

​Description

​Characteristics

​Architecture

​Configuration

​Memory Requirements

​Download

Zipvoice Models

​Model Detection

​When to Use

Custom Voices

Voice Cloning Apps

Dubbing & Translation

Personalization

​Advantages

​Limitations

​Reference Audio Requirements

​Pocket

​Description

​Characteristics

​Configuration

​Streaming with Voice Cloning

​Extra Options

​Download

Pocket Models

​Model Detection

​When to Use

Voice Cloning + Streaming

Modern Architecture

Flexible Options

Interactive Apps

​Advantages

​Limitations

​Comparison Table

​Choosing Between Models

​For Voice Cloning

​For Speed

​For Low-End Devices

​For High Quality

​Next Steps

TTS Overview

TTS API

Streaming TTS

Model Setup

Build docs developers (and LLMs) love

Other TTS Models

Overview

KittenTTS

Description

Characteristics

Configuration

Streaming Example

Download

Model Detection

When to Use

Advantages

Limitations

Zipvoice

Description

Characteristics

Architecture

Configuration

Memory Requirements

Download

Model Detection

When to Use

Advantages

Limitations

Reference Audio Requirements

Pocket

Description

Characteristics

Configuration

Streaming with Voice Cloning

Extra Options

Download

Model Detection

When to Use

Advantages

Limitations

Comparison Table

Choosing Between Models

For Voice Cloning

For Speed

For Low-End Devices

For High Quality

Next Steps