Text-to-Speech - React Native Sherpa-ONNX

Overview

The TTS (Text-to-Speech) module provides offline speech synthesis using various model types. Generate complete audio from text, with support for multiple speakers, adjustable speed, and model-specific parameters. Key features:

Multiple model types (VITS, Matcha, Kokoro, Kitten, Pocket, Zipvoice)
Multi-speaker support (speaker selection by ID)
Adjustable speech speed
Voice cloning with reference audio (Pocket, Zipvoice)
Timestamp generation
Save to WAV files or play directly

Quick Start

import { createTTS, saveAudioToFile } from 'react-native-sherpa-onnx/tts';

// Create a TTS engine
const tts = await createTTS({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-vits-piper-en_US-libritts_r-medium',
  },
  modelType: 'auto',  // Auto-detect from files
  numThreads: 2,
});

// Generate speech
const audio = await tts.generateSpeech('Hello, world!');
console.log('Sample rate:', audio.sampleRate);
console.log('Samples:', audio.samples.length);

// Save to file
await saveAudioToFile(audio, '/path/to/output.wav');

// Clean up
await tts.destroy();

Supported Model Types

Model Type	Description	Config Options
`vits`	VITS models (Piper, Coqui, MeloTTS, MMS)	noiseScale, noiseScaleW, lengthScale
`matcha`	Matcha models	noiseScale, lengthScale
`kokoro`	Kokoro (multi-speaker, multi-language)	lengthScale
`kitten`	KittenTTS (lightweight)	lengthScale
`pocket`	Pocket TTS (voice cloning)	Voice cloning via referenceAudio
`zipvoice`	Zipvoice (voice cloning)	Voice cloning via referenceAudio
`auto`	Auto-detect from files	—

Use modelType: 'auto' to automatically detect the model type. The SDK will choose the correct type based on files in the directory.

API Reference

createTTS(options)

Creates a TTS engine for batch (one-shot) speech generation.

src/tts/index.ts

export async function createTTS(
  options: TTSInitializeOptions | ModelPathConfig
): Promise<TtsEngine>;

Options:

modelPath

ModelPathConfig

required

Model directory path. Use { type: 'asset', path: 'models/...' } for bundled assets.

modelType

TTSModelType

default:"auto"

Model type: 'vits', 'matcha', 'kokoro', 'kitten', 'pocket', 'zipvoice', or 'auto'.

numThreads

number

default:"2"

Number of threads for inference. More threads = faster but more CPU usage.

provider

string

default:"cpu"

Execution provider (e.g., 'cpu', 'coreml', 'xnnpack'). See Execution Providers.

debug

boolean

default:"false"

Enable debug logging.

modelOptions

TtsModelOptions

Model-specific configuration. Only the block for the loaded model type is applied:

vits: { noiseScale, noiseScaleW, lengthScale }
matcha: { noiseScale, lengthScale }
kokoro: { lengthScale }
kitten: { lengthScale }

ruleFsts

string

Path(s) to rule FSTs for text normalization.

ruleFars

string

Path(s) to rule FARs for text normalization.

maxNumSentences

number

default:"1"

Max sentences per streaming callback.

silenceScale

number

default:"0.2"

Silence scale on config level.

TtsEngine: generateSpeech(text, options?)

Generate speech audio from text.

const audio = await tts.generateSpeech(
  'Hello, world!',
  {
    sid: 0,        // Speaker ID
    speed: 1.0,    // Speech speed
  }
);

Returns GeneratedAudio:

interface GeneratedAudio {
  samples: number[];    // Float PCM in [-1.0, 1.0]
  sampleRate: number;   // Sample rate in Hz (e.g., 22050)
}

Generation Options:

sid

number

default:"0"

Speaker ID for multi-speaker models. Use getNumSpeakers() to check available speakers.

speed

number

default:"1.0"

Speech speed multiplier:

1.0 = normal speed
0.5 = half speed (slower)
2.0 = double speed (faster)

silenceScale

number

Silence scale at generation time (model-dependent).

referenceAudio

{ samples: number[], sampleRate: number }

Reference audio for voice cloning (Pocket, Zipvoice). Mono float samples in [-1, 1].

referenceText

string

Transcript of reference audio (required when using referenceAudio).

numSteps

number

Flow-matching steps (Pocket TTS).

extra

Record<string, string>

Model-specific options (e.g., Pocket: { temperature: '0.7', chunk_size: '15' }).

TtsEngine: generateSpeechWithTimestamps(text, options?)

Generate speech with word-level timestamps.

const result = await tts.generateSpeechWithTimestamps('Hello world');

console.log('Samples:', result.samples);
console.log('Sample rate:', result.sampleRate);
console.log('Subtitles:', result.subtitles);
// [{ text: 'Hello', start: 0, end: 0.5 }, { text: 'world', start: 0.6, end: 1.0 }]
console.log('Estimated:', result.estimated);  // true if timestamps are estimated

TtsEngine: updateParams(options)

Update model parameters at runtime without reloading.

await tts.updateParams({
  modelOptions: {
    vits: {
      noiseScale: 0.7,
      lengthScale: 1.2,
    },
  },
});

TtsEngine: getModelInfo()

Get model information (sample rate and number of speakers).

const info = await tts.getModelInfo();
console.log('Sample rate:', info.sampleRate);
console.log('Number of speakers:', info.numSpeakers);

TtsEngine: getSampleRate()

Get the model’s sample rate.

const sampleRate = await tts.getSampleRate();
console.log('Sample rate:', sampleRate);  // e.g., 22050

TtsEngine: getNumSpeakers()

Get the number of available speakers.

const numSpeakers = await tts.getNumSpeakers();
console.log('Speakers:', numSpeakers);  // 0 or 1 = single-speaker, >1 = multi-speaker

TtsEngine: destroy()

Release native resources. Must be called when done.

await tts.destroy();

Saving Audio

Save to File

import { saveAudioToFile } from 'react-native-sherpa-onnx/tts';

const audio = await tts.generateSpeech('Hello, world!');
await saveAudioToFile(audio, '/path/to/output.wav');

Android: Save via SAF (Storage Access Framework)

import { saveAudioToContentUri } from 'react-native-sherpa-onnx/tts';

// User selects directory via SAF
const directoryUri = 'content://...';

const audio = await tts.generateSpeech('Hello, world!');
const fileUri = await saveAudioToContentUri(
  audio,
  directoryUri,
  'output.wav'
);

console.log('Saved to:', fileUri);

import { shareAudioFile } from 'react-native-sherpa-onnx/tts';

// Save first
const filePath = '/path/to/output.wav';
await saveAudioToFile(audio, filePath);

// Share
await shareAudioFile(filePath, 'audio/wav');

Model-Specific Configuration

VITS Models

VITS models support three tuning parameters:

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
  modelType: 'vits',
  modelOptions: {
    vits: {
      noiseScale: 0.667,    // Controls randomness (0.0-1.0)
      noiseScaleW: 0.8,     // Duration randomness (0.0-1.0)
      lengthScale: 1.0,     // Speech speed (0.5=slower, 2.0=faster)
    },
  },
});

Matcha Models

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/matcha-en' },
  modelType: 'matcha',
  modelOptions: {
    matcha: {
      noiseScale: 0.667,
      lengthScale: 1.0,
    },
  },
});

Kokoro Models

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/kokoro-en' },
  modelType: 'kokoro',
  modelOptions: {
    kokoro: {
      lengthScale: 1.2,  // Only lengthScale for Kokoro
    },
  },
});

Voice Cloning

Pocket and Zipvoice models support voice cloning via reference audio.

Pocket TTS (Voice Cloning)

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/pocket-tts' },
  modelType: 'pocket',
});

// Load reference audio (3-10 seconds recommended)
const refAudio = loadReferenceAudio();  // Your audio loading function

const audio = await tts.generateSpeech(
  'This is the target text to speak.',
  {
    referenceAudio: {
      samples: refAudio.samples,  // Float32 mono [-1, 1]
      sampleRate: 22050,
    },
    referenceText: 'This is what the reference recording says.',
    numSteps: 20,      // Flow-matching steps (higher = better quality, slower)
    speed: 1.0,
    extra: {
      temperature: '0.7',   // Randomness (0.0-1.0)
      chunk_size: '15',     // Chunk size for generation
    },
  }
);

Zipvoice (Voice Cloning)

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/zipvoice-zh-en' },
  modelType: 'zipvoice',
});

const audio = await tts.generateSpeech(
  'Target speech text.',
  {
    referenceAudio: {
      samples: refAudioSamples,
      sampleRate: 24000,
    },
    referenceText: 'Transcript of reference.',
    numSteps: 20,
    speed: 1.0,
  }
);

Zipvoice streaming with voice cloning is not supported. Use generateSpeech() (batch mode) for voice cloning with Zipvoice. For Pocket TTS, both batch and streaming modes support voice cloning.

Zipvoice Memory Requirements:The full fp32 Zipvoice model (~605 MB) requires significant RAM. On devices with less than 8 GB RAM, use the int8 distill variant (sherpa-onnx-zipvoice-distill-int8-zh-en-emilia, ~104 MB) to avoid crashes.The SDK checks free memory before loading and rejects initialization if below ~800 MB.

Multi-Speaker Models

Some models include multiple speakers (voices).

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-multi-speaker' },
});

// Check available speakers
const numSpeakers = await tts.getNumSpeakers();
console.log('Available speakers:', numSpeakers);

// Generate with different speakers
for (let sid = 0; sid < numSpeakers; sid++) {
  const audio = await tts.generateSpeech('Hello from speaker ' + sid, { sid });
  await saveAudioToFile(audio, `/path/speaker_${sid}.wav`);
}

await tts.destroy();

Model Detection

Detect TTS model type without initializing:

import { detectTtsModel } from 'react-native-sherpa-onnx/tts';

const result = await detectTtsModel(
  { type: 'asset', path: 'models/vits-piper-en' }
);

if (result.success) {
  console.log('Detected type:', result.modelType);
  console.log('Models:', result.detectedModels);
  
  if (result.modelType === 'vits' || result.modelType === 'matcha') {
    // Show noise/length scale options in UI
  }
}

Performance Optimization

Threading

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
  numThreads: 4,  // More threads = faster
});

Hardware Acceleration

import { getCoreMlSupport } from 'react-native-sherpa-onnx';

// iOS: Check Core ML support
const coremlSupport = await getCoreMlSupport();
if (coremlSupport.hasAccelerator) {
  const tts = await createTTS({
    modelPath: { type: 'asset', path: 'models/vits-piper-en' },
    provider: 'coreml',  // Use Apple Neural Engine
  });
}

Speed Control

Adjust speech speed at generation time:

// Slower speech
const slowAudio = await tts.generateSpeech('Hello', { speed: 0.75 });

// Faster speech
const fastAudio = await tts.generateSpeech('Hello', { speed: 1.5 });

Common Use Cases

Generate and Play

import Sound from 'react-native-sound';

const audio = await tts.generateSpeech('Hello, world!');

// Save temporarily
const tempPath = `${RNFS.CachesDirectoryPath}/temp.wav`;
await saveAudioToFile(audio, tempPath);

// Play
const sound = new Sound(tempPath, '', (error) => {
  if (!error) {
    sound.play();
  }
});

Batch Generation

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/vits-piper-en' },
});

const phrases = [
  'Good morning',
  'How are you?',
  'Have a nice day',
];

for (const [index, phrase] of phrases.entries()) {
  const audio = await tts.generateSpeech(phrase);
  await saveAudioToFile(audio, `/path/phrase_${index}.wav`);
}

await tts.destroy();

Dynamic Speaker Selection

const tts = await createTTS({
  modelPath: { type: 'asset', path: 'models/kokoro-multi' },
});

const numSpeakers = await tts.getNumSpeakers();

// User selects speaker
const selectedSpeaker = 2;

if (selectedSpeaker < numSpeakers) {
  const audio = await tts.generateSpeech(
    'User-entered text',
    { sid: selectedSpeaker }
  );
}

Troubleshooting

Error: TTS initialization failed

Verify model directory exists and contains required files
For VITS: need model.onnx, tokens.txt, espeak-ng-data (some models)
For Zipvoice: need encoder, decoder, vocoder, tokens, lexicon, espeak-ng-data
Try modelType: 'auto' for automatic detection
Enable debug: true for detailed logs

Out of memory with Zipvoice

The full Zipvoice model (~605 MB) requires significant RAM:

Use the int8 distill variant: sherpa-onnx-zipvoice-distill-int8-zh-en-emilia (~104 MB)
Close other apps to free memory
Target devices with 8+ GB RAM for full model

Audio sounds robotic or poor quality

Adjust noiseScale (VITS/Matcha): try 0.667-1.0
Adjust lengthScale: values close to 1.0 are more natural
Try a larger/better model
Increase numSteps for flow-matching models (Pocket)

Speech too fast or too slow

Use the speed parameter at generation time:

const audio = await tts.generateSpeech('Text', { speed: 0.8 });  // Slower

Or adjust lengthScale in model options (permanent).

Voice cloning not working

Ensure model supports voice cloning (Pocket, Zipvoice)
Reference audio should be 3-10 seconds, clear, mono
Provide accurate referenceText transcript
For Zipvoice, use generateSpeech() not streaming
Increase numSteps for better quality

Next Steps

Streaming TTS

Low-latency streaming generation

Model Setup

Learn how to bundle and load models

Speech-to-Text

Transcribe audio to text

Execution Providers

Hardware acceleration options

Get Started

Core Features

Advanced

Configuration

​Overview

​Quick Start

​Supported Model Types

​API Reference

​createTTS(options)

​TtsEngine: generateSpeech(text, options?)

​TtsEngine: generateSpeechWithTimestamps(text, options?)

​TtsEngine: updateParams(options)

​TtsEngine: getModelInfo()

​TtsEngine: getSampleRate()

​TtsEngine: getNumSpeakers()

​TtsEngine: destroy()

​Saving Audio

​Save to File

​Android: Save via SAF (Storage Access Framework)

​Share Audio File

​Model-Specific Configuration

​VITS Models

​Matcha Models

​Kokoro Models

​Voice Cloning

​Pocket TTS (Voice Cloning)

​Zipvoice (Voice Cloning)

​Multi-Speaker Models

​Model Detection

​Performance Optimization

​Threading

​Hardware Acceleration

​Speed Control

​Common Use Cases

​Generate and Play

​Batch Generation

​Dynamic Speaker Selection

​Troubleshooting

​Next Steps

Streaming TTS

Model Setup

Speech-to-Text

Execution Providers

Build docs developers (and LLMs) love

Overview

Quick Start

Supported Model Types

API Reference

createTTS(options)

TtsEngine: generateSpeech(text, options?)

TtsEngine: generateSpeechWithTimestamps(text, options?)

TtsEngine: updateParams(options)

TtsEngine: getModelInfo()

TtsEngine: getSampleRate()

TtsEngine: getNumSpeakers()

TtsEngine: destroy()

Saving Audio

Save to File

Android: Save via SAF (Storage Access Framework)

Share Audio File

Model-Specific Configuration

VITS Models

Matcha Models

Kokoro Models

Voice Cloning

Pocket TTS (Voice Cloning)

Zipvoice (Voice Cloning)

Multi-Speaker Models

Model Detection

Performance Optimization

Threading

Hardware Acceleration

Speed Control

Common Use Cases

Generate and Play

Batch Generation

Dynamic Speaker Selection

Troubleshooting

Next Steps