Text to Speech

The useTextToSpeech hook converts text into natural-sounding speech using the Kokoro TTS model. It supports both complete audio generation and streaming playback for real-time applications.

Basic Usage

import { useTextToSpeech } from 'react-native-executorch';

function TextReader() {
  const { forward, isReady, error } = useTextToSpeech({
    model: {
      type: 'kokoro',
      durationPredictorSource: require('./models/duration-predictor.pte'),
      synthesizerSource: require('./models/synthesizer.pte'),
    },
    voice: {
      lang: 'en-us',
      voiceSource: require('./voices/en-us-voice.bin'),
      extra: {
        taggerSource: require('./models/tagger.pte'),
        lexiconSource: require('./models/lexicon.bin'),
      },
    },
  });

  const speak = async () => {
    if (!isReady) return;

    const audio = await forward({
      text: 'Hello, this is a text to speech demo.',
      speed: 1.0,
    });

    // Play the audio using your audio player
    console.log('Generated audio samples:', audio.length);
  };

  return (
    <View>
      {error && <Text>Error: {error.message}</Text>}
      <Button onPress={speak} title="Speak" disabled={!isReady} />
    </View>
  );
}

Hook Signature

useTextToSpeech(props)

function useTextToSpeech(props: TextToSpeechProps): TextToSpeechType;

Parameters

model

KokoroConfig

required

Kokoro TTS model configuration.

Show properties

type

'kokoro'

required

Model type identifier. Currently only 'kokoro' is supported.

durationPredictorSource

ResourceSource

required

Location of the duration predictor .pte file. Can be a URL (string), local file (require), or resource ID (number).

synthesizerSource

ResourceSource

required

Location of the synthesizer .pte file. Can be a URL (string), local file (require), or resource ID (number).

voice

VoiceConfig

required

Voice configuration including language and embeddings.

Show properties

lang

TextToSpeechLanguage

required

Speaker’s language. Currently supports 'en-us' (American English) or 'en-gb' (British English).

voiceSource

ResourceSource

required

Location of the voice embedding binary file.

extra

KokoroVoiceExtras

Additional Kokoro-specific voice resources.

Show properties

taggerSource

ResourceSource

required

Location of the phoneme tagger model binary.

lexiconSource

ResourceSource

required

Location of the pronunciation lexicon binary.

preventLoad

boolean

default:"false"

Prevent automatic model loading on mount. Useful for lazy loading scenarios.

Returns

error

RnExecutorchError | null

Contains error details if model loading or generation fails.

isReady

boolean

Indicates whether the model has loaded successfully and is ready for synthesis.

isGenerating

boolean

Indicates whether audio generation is currently in progress.

downloadProgress

number

Download progress as a value between 0 and 1.

forward

(input: TextToSpeechInput) => Promise<Float32Array>

Generate complete audio for the given text in a single pass. Returns 22kHz mono audio.

stream

(input: TextToSpeechStreamingInput) => Promise<void>

Generate audio incrementally with callbacks for real-time playback. Optimal for long text.

streamStop

() => void

Stop the current streaming generation process.

Generation Methods

Complete Audio Generation

Generate the entire audio at once:

const { forward, isReady } = useTextToSpeech({ model, voice });

const audio = await forward({
  text: 'Welcome to React Native ExecuTorch.',
  speed: 1.0, // Normal speed
});

// audio is Float32Array with 22kHz mono samples
console.log('Sample rate: 22050 Hz');
console.log('Duration:', audio.length / 22050, 'seconds');

Streaming Audio Generation

Generate and play audio incrementally:

const { stream, isReady } = useTextToSpeech({ model, voice });

await stream({
  text: 'This is a longer text that will be synthesized in chunks.',
  speed: 1.2, // 20% faster
  onBegin: async () => {
    console.log('Starting audio generation...');
    // Initialize audio player
  },
  onNext: async (audioChunk: Float32Array) => {
    console.log('Received chunk:', audioChunk.length, 'samples');
    // Play chunk immediately
    await audioPlayer.playChunk(audioChunk);
  },
  onEnd: async () => {
    console.log('Audio generation complete');
    // Cleanup
  },
});

Types

TextToSpeechInput

Input for audio generation:

interface TextToSpeechInput {
  text: string; // Text to synthesize
  speed?: number; // Speed multiplier (default: 1.0)
}

TextToSpeechStreamingInput

Input for streaming generation with lifecycle callbacks:

interface TextToSpeechStreamingInput extends TextToSpeechInput {
  onBegin?: () => void | Promise<void>; // Called when generation starts
  onNext?: (audio: Float32Array) => void | Promise<void>; // Called for each chunk
  onEnd?: () => void | Promise<void>; // Called when generation completes
}

TextToSpeechLanguage

Supported language codes:

type TextToSpeechLanguage =
  | 'en-us' // American English
  | 'en-gb'; // British English

VoiceConfig

Voice configuration structure:

interface VoiceConfig {
  lang: TextToSpeechLanguage;
  voiceSource: ResourceSource;
  extra?: KokoroVoiceExtras;
}

KokoroVoiceExtras

Kokoro-specific voice resources:

interface KokoroVoiceExtras {
  taggerSource: ResourceSource; // Phoneme tagger model
  lexiconSource: ResourceSource; // Pronunciation lexicon
}

KokoroConfig

Kokoro TTS model configuration:

interface KokoroConfig {
  type: 'kokoro';
  durationPredictorSource: ResourceSource;
  synthesizerSource: ResourceSource;
}

Audio Format

The generated audio has the following characteristics:

Sample rate: 22,050 Hz (22kHz)
Channels: Mono (single channel)
Data type: Float32Array
Value range: -1.0 to 1.0 (normalized)
Buffer layout: Contiguous samples in time order

Playing Generated Audio

Example using a typical audio player:

import { Audio } from 'expo-av';

const { forward } = useTextToSpeech({ model, voice });

const speakText = async (text: string) => {
  // Generate audio
  const audioData = await forward({ text, speed: 1.0 });

  // Convert Float32Array to format suitable for your audio player
  const audioBuffer = convertToAudioBuffer(audioData, 22050);

  // Play audio
  const sound = new Audio.Sound();
  await sound.loadAsync({ uri: audioBuffer });
  await sound.playAsync();
};

Advanced Usage

Speed Control

Adjust speech rate for different contexts:

// Slower speech for clarity (0.8x speed)
await forward({ text: 'Important instructions here.', speed: 0.8 });

// Normal speed (1.0x)
await forward({ text: 'Regular conversation.', speed: 1.0 });

// Faster speech for quick playback (1.5x speed)
await forward({ text: 'Quick summary.', speed: 1.5 });

Streaming with Progress Tracking

function TTSWithProgress() {
  const [progress, setProgress] = useState(0);
  const [totalChunks, setTotalChunks] = useState(0);
  const { stream } = useTextToSpeech({ model, voice });

  const speakWithTracking = async (text: string) => {
    let chunkCount = 0;

    await stream({
      text,
      onBegin: async () => {
        setProgress(0);
        setTotalChunks(0);
      },
      onNext: async (audioChunk) => {
        chunkCount++;
        setTotalChunks(chunkCount);
        setProgress((prev) => prev + audioChunk.length);
        
        // Play chunk
        await playAudioChunk(audioChunk);
      },
      onEnd: async () => {
        console.log(`Completed ${chunkCount} chunks`);
      },
    });
  };

  return (
    <View>
      <Text>Chunks: {totalChunks}</Text>
      <Text>Samples: {progress}</Text>
    </View>
  );
}

Multiple Voices

Switch between different voice configurations:

const americanVoice: VoiceConfig = {
  lang: 'en-us',
  voiceSource: require('./voices/en-us-male.bin'),
  extra: {
    taggerSource: require('./models/tagger.pte'),
    lexiconSource: require('./models/en-us-lexicon.bin'),
  },
};

const britishVoice: VoiceConfig = {
  lang: 'en-gb',
  voiceSource: require('./voices/en-gb-female.bin'),
  extra: {
    taggerSource: require('./models/tagger.pte'),
    lexiconSource: require('./models/en-gb-lexicon.bin'),
  },
};

// Use different hooks for different voices
const american = useTextToSpeech({ model, voice: americanVoice });
const british = useTextToSpeech({ model, voice: britishVoice });

Interrupting Playback

const { stream, streamStop } = useTextToSpeech({ model, voice });

// Start streaming
const speakPromise = stream({
  text: 'This is a very long text that will take time to synthesize...',
  onNext: async (chunk) => {
    await playAudioChunk(chunk);
  },
});

// Stop mid-stream
const handleStop = () => {
  streamStop(); // Interrupts generation
  stopAudioPlayback(); // Stop playing audio
};

Error Handling

const { forward, error, isReady } = useTextToSpeech({ model, voice });

if (error) {
  console.error('TTS Error:', error.message);
  // Handle specific error codes
}

try {
  const audio = await forward({ text: 'Hello world' });
} catch (err) {
  if (err.code === 'MODULE_NOT_LOADED') {
    console.error('Model not ready yet');
  } else if (err.code === 'MODEL_GENERATING') {
    console.error('Already generating audio');
  } else {
    console.error('Generation failed:', err.message);
  }
}

Best Practices

Text Length: For long text, use streaming mode to start playback sooner and reduce memory usage.
Speed Range: Keep speed between 0.5 and 2.0 for natural-sounding speech. Extreme values may degrade quality.
Memory Management: Clear audio buffers after playback to free memory, especially for long content.
Error Recovery: Always check isReady before calling forward() or stream().
Concurrent Requests: The hook prevents concurrent generation. Wait for completion or use streamStop() before starting new generation.
Text Preprocessing: Clean up text (remove special characters, normalize numbers) for better pronunciation.
Resource Caching: Models and voices are cached after first download. Reuse the same sources to avoid re-downloading.

Performance Tips

Streaming vs. Complete: Use streaming for text longer than a few sentences to reduce perceived latency.
Chunk Processing: Process audio chunks asynchronously to maintain smooth playback.
Preload Models: Set preventLoad: false (default) to load models on component mount.
Voice Selection: Choose appropriate voice embeddings for your use case (male/female, accent, etc.).

Common Use Cases

Audio Book Reader

function AudioBookReader({ chapters }: { chapters: string[] }) {
  const { stream, isReady } = useTextToSpeech({ model, voice });
  const [currentChapter, setCurrentChapter] = useState(0);

  const readChapter = async (chapterText: string) => {
    await stream({
      text: chapterText,
      speed: 1.1, // Slightly faster for continuous listening
      onNext: async (chunk) => {
        await playAudioChunk(chunk);
      },
      onEnd: async () => {
        // Auto-advance to next chapter
        if (currentChapter < chapters.length - 1) {
          setCurrentChapter((prev) => prev + 1);
        }
      },
    });
  };

  return <AudioPlayer onPlay={() => readChapter(chapters[currentChapter])} />;
}

function ScreenReader({ content }: { content: string }) {
  const { forward, isReady } = useTextToSpeech({ model, voice });

  const speak = async () => {
    const audio = await forward({
      text: content,
      speed: 1.0,
    });
    await playAudio(audio);
  };

  return (
    <TouchableOpacity onPress={speak} disabled={!isReady}>
      <Text>{content}</Text>
    </TouchableOpacity>
  );
}

Speech to Text - Convert speech back to text
Speech Overview - Complete guide to speech features

Getting Started

Core Concepts

Large Language Models

Computer Vision

Speech & Audio

Text Embeddings

Advanced

Guides

Basic Usage

Hook Signature

useTextToSpeech(props)

Parameters

Returns

Generation Methods

Complete Audio Generation

Streaming Audio Generation

Types

TextToSpeechInput

TextToSpeechStreamingInput

TextToSpeechLanguage

VoiceConfig

KokoroVoiceExtras

KokoroConfig

Audio Format

Playing Generated Audio

Advanced Usage

Speed Control

Streaming with Progress Tracking

Multiple Voices

Interrupting Playback

Error Handling

Best Practices

Performance Tips

Common Use Cases

Audio Book Reader

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Large Language Models

Computer Vision

Speech & Audio

Text Embeddings

Advanced

Guides

​Basic Usage

​Hook Signature

​useTextToSpeech(props)

​Parameters

​Returns

​Generation Methods

​Complete Audio Generation

​Streaming Audio Generation

​Types

​TextToSpeechInput

​TextToSpeechStreamingInput

​TextToSpeechLanguage

​VoiceConfig

​KokoroVoiceExtras

​KokoroConfig

​Audio Format

​Playing Generated Audio

​Advanced Usage

​Speed Control

​Streaming with Progress Tracking

​Multiple Voices

​Interrupting Playback

​Error Handling

​Best Practices

​Performance Tips

​Common Use Cases

​Audio Book Reader

​Accessibility Screen Reader

​Related

Build docs developers (and LLMs) love

Basic Usage

Hook Signature

useTextToSpeech(props)

Parameters

Returns

Generation Methods

Complete Audio Generation

Streaming Audio Generation

Types

TextToSpeechInput

TextToSpeechStreamingInput

TextToSpeechLanguage

VoiceConfig

KokoroVoiceExtras

KokoroConfig

Audio Format

Playing Generated Audio

Advanced Usage

Speed Control

Streaming with Progress Tracking

Multiple Voices

Interrupting Playback

Error Handling

Best Practices

Performance Tips

Common Use Cases

Audio Book Reader

Accessibility Screen Reader

Related