Speech & Audio Overview

React Native ExecuTorch provides powerful on-device speech and audio processing capabilities through three core features:

Core Features

Speech to Text (STT)

Convert spoken audio into text using on-device Whisper models. Supports both single-pass transcription and streaming modes with real-time results.

Multilingual support: 96+ languages with automatic language detection
Streaming transcription: Real-time audio processing with committed and non-committed results
Word-level timestamps: Detailed timing information when verbose mode is enabled
Audio format: Requires 16kHz mono audio as Float32Array

Learn more about Speech to Text

Text to Speech (TTS)

Generate natural-sounding speech from text using the Kokoro TTS model. Supports both complete audio generation and streaming playback.

Multiple voices: English (US and GB) with customizable voice embeddings
Speed control: Adjustable speech rate for different use cases
Streaming mode: Start playback before full synthesis completes
Audio output: Returns 22kHz mono audio as Float32Array

Learn more about Text to Speech

Voice Activity Detection (VAD)

Detect speech segments in audio streams with precise timestamp boundaries. Essential for building voice-activated features and efficient audio processing.

Segment detection: Identifies start and end times of speech activity
Low latency: Fast on-device processing for real-time applications
Audio format: Requires 16kHz mono audio as Float32Array
Timestamp precision: Returns segments in seconds

Learn more about Voice Activity Detection

Audio Format Requirements

All speech and audio models require audio in a specific format:

Sample rate: 16kHz (except TTS output which is 22kHz)
Channels: Mono (single channel)
Data type: Float32Array with values normalized between -1.0 and 1.0
Buffer format: Contiguous samples in time order

Common Use Cases

Voice Assistant

Combine VAD, STT, and TTS to build a complete voice assistant that listens, understands, and responds.

Live Transcription

Use streaming STT with VAD to provide real-time captions for meetings, lectures, or media.

Audio Books

Generate natural-sounding narration from text content with speed control.

Voice Commands

Detect when users speak and transcribe commands for hands-free interaction.

Best Practices

Audio Preprocessing

Always resample audio to 16kHz before processing
Convert stereo audio to mono by averaging channels
Normalize audio samples to the range [-1.0, 1.0]
Remove DC offset and apply appropriate filtering

Memory Management

Reuse Float32Array buffers when possible to reduce allocations
Process audio in chunks for long recordings
Clean up resources when components unmount
Monitor download progress for large model files

Performance Optimization

Use streaming modes for real-time requirements
Batch audio processing when latency is not critical
Leverage VAD to process only speech segments
Cache models to avoid repeated downloads

Model Downloads

All speech models support automatic downloading from remote sources:

import { useSpeechToText } from 'react-native-executorch';

const { isReady, downloadProgress } = useSpeechToText({
  model: {
    isMultilingual: true,
    encoderSource: 'https://example.com/whisper-encoder.pte',
    decoderSource: 'https://example.com/whisper-decoder.pte',
    tokenizerSource: 'https://example.com/whisper-tokenizer.json',
  },
});

// Monitor download progress
console.log(`Download: ${(downloadProgress * 100).toFixed(0)}%`);

Error Handling

All speech hooks provide error states for handling failures:

const { error, isReady } = useSpeechToText({ model });

if (error) {
  console.error('Failed to load model:', error.message);
  // Handle specific error codes
  switch (error.code) {
    case 'MODULE_NOT_LOADED':
      // Model not ready yet
      break;
    case 'DOWNLOAD_INTERRUPTED':
      // Network issue during download
      break;
    // ... handle other cases
  }
}

Next Steps

Speech to Text

Convert audio to text

Text to Speech

Generate spoken audio

Voice Activity Detection

Detect speech segments

Getting Started

Core Concepts

Large Language Models

Computer Vision

Speech & Audio

Text Embeddings

Advanced

Guides

Core Features

Speech to Text (STT)

Text to Speech (TTS)

Voice Activity Detection (VAD)

Audio Format Requirements

Common Use Cases

Voice Assistant

Live Transcription

Audio Books

Voice Commands

Best Practices

Audio Preprocessing

Memory Management

Performance Optimization

Model Downloads

Error Handling

Next Steps

Speech to Text

Text to Speech

Voice Activity Detection

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Large Language Models

Computer Vision

Speech & Audio

Text Embeddings

Advanced

Guides

​Core Features

​Speech to Text (STT)

​Text to Speech (TTS)

​Voice Activity Detection (VAD)

​Audio Format Requirements

​Common Use Cases

Voice Assistant

Live Transcription

Audio Books

Voice Commands

​Best Practices

​Audio Preprocessing

​Memory Management

​Performance Optimization

​Model Downloads

​Error Handling

​Next Steps

Speech to Text

Text to Speech

Voice Activity Detection

Build docs developers (and LLMs) love

Core Features

Speech to Text (STT)

Text to Speech (TTS)

Voice Activity Detection (VAD)

Audio Format Requirements

Common Use Cases

Best Practices

Audio Preprocessing

Memory Management

Performance Optimization

Model Downloads

Error Handling

Next Steps