Speech to Text

The useSpeechToText hook provides on-device speech recognition powered by Whisper models. It supports both single-pass transcription and real-time streaming with word-level timestamps.

Basic Usage

import { useSpeechToText } from 'react-native-executorch';

function VoiceRecorder() {
  const { transcribe, isReady, error } = useSpeechToText({
    model: {
      isMultilingual: false,
      encoderSource: require('./models/whisper-encoder.pte'),
      decoderSource: require('./models/whisper-decoder.pte'),
      tokenizerSource: require('./models/tokenizer.json'),
    },
  });

  const handleTranscribe = async (audioBuffer: Float32Array) => {
    if (!isReady) return;

    const result = await transcribe(audioBuffer, {
      language: 'en',
      verbose: true,
    });

    console.log('Transcription:', result.text);
    console.log('Language:', result.language);
    console.log('Duration:', result.duration);
  };

  return (
    <View>
      {error && <Text>Error: {error.message}</Text>}
      {isReady ? (
        <Button onPress={handleTranscribe} title="Transcribe" />
      ) : (
        <Text>Loading model...</Text>
      )}
    </View>
  );
}

Hook Signature

useSpeechToText(props)

function useSpeechToText(props: SpeechToTextProps): SpeechToTextType;

Parameters

model

SpeechToTextModelConfig

required

Configuration object containing model sources.

Show properties

isMultilingual

boolean

required

Whether the model supports multiple languages. Set to false for whisper.en models (English-only), true for multilingual whisper models.

encoderSource

ResourceSource

required

Location of the encoder .pte file. Can be a URL (string), local file (require), or resource ID (number).

decoderSource

ResourceSource

required

Location of the decoder .pte file. Can be a URL (string), local file (require), or resource ID (number).

tokenizerSource

ResourceSource

required

Location of the tokenizer JSON file. Can be a URL (string), local file (require), or resource ID (number).

preventLoad

boolean

default:"false"

Prevent automatic model loading on mount. Useful for lazy loading scenarios.

Returns

error

RnExecutorchError | null

Contains error details if model loading or inference fails.

isReady

boolean

Indicates whether the model has loaded successfully and is ready for transcription.

isGenerating

boolean

Indicates whether a transcription is currently in progress.

downloadProgress

number

Download progress as a value between 0 and 1.

transcribe

(waveform: Float32Array, options?: DecodingOptions) => Promise<TranscriptionResult>

Transcribe audio in a single pass. Accepts a 16kHz mono audio waveform and optional decoding options.

stream

(options?: DecodingOptions) => AsyncGenerator<StreamResult>

Start a streaming transcription session. Returns an async generator yielding committed and non-committed results.

streamInsert

(waveform: Float32Array) => void

Insert audio chunks into the active streaming session. Audio must be 16kHz mono.

streamStop

() => void

Stop the current streaming session and finalize transcription.

encode

(waveform: Float32Array) => Promise<Float32Array>

Run only the encoder on the audio waveform. Advanced usage for custom decoding.

decode

(tokens: Int32Array, encoderOutput: Float32Array) => Promise<Float32Array>

Run only the decoder. Advanced usage for custom decoding strategies.

Transcription Methods

Single-Pass Transcription

Transcribe complete audio files or recordings:

const { transcribe, isReady } = useSpeechToText({ model });

const result = await transcribe(audioBuffer, {
  language: 'en',
  verbose: true,
});

console.log(result.text); // Full transcription text
console.log(result.duration); // Audio duration in seconds
console.log(result.language); // Detected/specified language

Streaming Transcription

For real-time transcription with live audio:

const { stream, streamInsert, streamStop, isReady } = useSpeechToText({ model });

const startStreaming = async () => {
  // Start the stream
  const generator = stream({ language: 'en' });

  // Process results as they arrive
  for await (const result of generator) {
    console.log('Committed:', result.committed.text);
    console.log('Non-committed:', result.nonCommitted.text);
  }
};

// Feed audio chunks (16kHz mono)
streamInsert(audioChunk1);
streamInsert(audioChunk2);
streamInsert(audioChunk3);

// Stop and finalize
streamStop();

Types

DecodingOptions

Options for controlling transcription behavior:

interface DecodingOptions {
  language?: SpeechToTextLanguage; // 'en', 'es', 'fr', etc.
  verbose?: boolean; // Include segments and timestamps
}

TranscriptionResult

The result object returned from transcription:

interface TranscriptionResult {
  task?: 'transcribe' | 'stream';
  language: string; // Language code (e.g., 'en')
  duration: number; // Audio duration in seconds
  text: string; // Complete transcription text
  segments?: TranscriptionSegment[]; // Present when verbose=true
}

TranscriptionSegment

Detailed segment information (when verbose: true):

interface TranscriptionSegment {
  start: number; // Start time in seconds
  end: number; // End time in seconds
  text: string; // Segment text
  words?: Word[]; // Word-level timestamps
  tokens: number[]; // Raw token IDs
  temperature: number; // Generation temperature
  avgLogprob: number; // Average log probability
  compressionRatio: number; // Compression ratio
}

Word

Word-level timestamp information:

interface Word {
  word: string; // The word text
  start: number; // Start time in seconds
  end: number; // End time in seconds
}

Supported Languages

For multilingual models (isMultilingual: true), you can specify any of these language codes: af, sq, ar, hy, az, eu, be, bn, bs, bg, my, ca, zh, hr, cs, da, nl, et, en, fi, fr, gl, ka, de, el, gu, ht, he, hi, hu, is, id, it, ja, kn, kk, km, ko, lo, lv, lt, mk, mg, ms, ml, mt, mr, ne, no, fa, pl, pt, pa, ro, ru, sr, si, sk, sl, es, su, sw, sv, tl, tg, ta, te, th, tr, uk, ur, uz, vi, cy, yi For English-only models (isMultilingual: false), only 'en' is supported.

Audio Format Requirements

All audio input must be in the correct format or transcription will fail.

Sample rate: 16kHz (16,000 samples per second)
Channels: Mono (single channel)
Data type: Float32Array
Value range: -1.0 to 1.0 (normalized)
Buffer layout: Contiguous samples in time order

Converting Audio

Example of converting typical audio to the required format:

function convertAudioTo16kHz(audioBuffer: AudioBuffer): Float32Array {
  // Resample to 16kHz if needed
  const targetSampleRate = 16000;
  const resampledBuffer = resampleAudio(audioBuffer, targetSampleRate);

  // Convert to mono if stereo
  const channelData = resampledBuffer.getChannelData(0);

  // Normalize to [-1.0, 1.0]
  const normalized = new Float32Array(channelData.length);
  for (let i = 0; i < channelData.length; i++) {
    normalized[i] = Math.max(-1, Math.min(1, channelData[i]));
  }

  return normalized;
}

Advanced Usage

Verbose Mode with Timestamps

Get detailed segment and word-level timing:

const result = await transcribe(audioBuffer, {
  language: 'en',
  verbose: true,
});

result.segments?.forEach((segment) => {
  console.log(`[${segment.start}s - ${segment.end}s]: ${segment.text}`);
  
  segment.words?.forEach((word) => {
    console.log(`  ${word.word} (${word.start}s - ${word.end}s)`);
  });
});

Custom Encoding and Decoding

For advanced use cases where you need control over the encoding and decoding process:

const { encode, decode } = useSpeechToText({ model });

// Encode audio to features
const encoderOutput = await encode(audioBuffer);

// Use custom token sequence
const tokens = new Int32Array([50258, 50259, 50359, /* ... */]);

// Decode with custom tokens
const logits = await decode(tokens, encoderOutput);

Streaming with Real-Time Display

function LiveTranscription() {
  const [committed, setCommitted] = useState('');
  const [tentative, setTentative] = useState('');
  const { stream, streamInsert, streamStop } = useSpeechToText({ model });

  const startLiveTranscription = async () => {
    const generator = stream({ language: 'en' });

    for await (const result of generator) {
      setCommitted(result.committed.text);
      setTentative(result.nonCommitted.text);
    }
  };

  return (
    <View>
      <Text style={{ fontWeight: 'bold' }}>{committed}</Text>
      <Text style={{ opacity: 0.6 }}>{tentative}</Text>
    </View>
  );
}

Error Handling

const { transcribe, error } = useSpeechToText({ model });

try {
  const result = await transcribe(audioBuffer);
} catch (err) {
  if (err.code === 'MODULE_NOT_LOADED') {
    console.error('Model not ready yet');
  } else if (err.code === 'MODEL_GENERATING') {
    console.error('Already processing audio');
  } else {
    console.error('Transcription failed:', err.message);
  }
}

Best Practices

Audio Quality: Use clean, clear audio for best results. Remove background noise when possible.
Chunk Size: For streaming, send audio chunks of 1-3 seconds for optimal latency and accuracy.
Language Specification: Always specify the language when known to improve accuracy and speed.
Verbose Mode: Use verbose: true only when you need timestamps to reduce processing overhead.
Memory Management: Clear large audio buffers after transcription to free memory.
Model Selection: Use whisper.en (English-only) for better performance when only English is needed.

Voice Activity Detection - Detect speech segments before transcription
Text to Speech - Convert transcribed text back to speech
Speech Overview - Complete guide to speech features

Getting Started

Core Concepts

Large Language Models

Computer Vision

Speech & Audio

Text Embeddings

Advanced

Guides

Basic Usage

Hook Signature

useSpeechToText(props)

Parameters

Returns

Transcription Methods

Single-Pass Transcription

Streaming Transcription

Types

DecodingOptions

TranscriptionResult

TranscriptionSegment

Word

Supported Languages

Audio Format Requirements

Converting Audio

Advanced Usage

Verbose Mode with Timestamps

Custom Encoding and Decoding

Streaming with Real-Time Display

Error Handling

Best Practices

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Large Language Models

Computer Vision

Speech & Audio

Text Embeddings

Advanced

Guides

​Basic Usage

​Hook Signature

​useSpeechToText(props)

​Parameters

​Returns

​Transcription Methods

​Single-Pass Transcription

​Streaming Transcription

​Types

​DecodingOptions

​TranscriptionResult

​TranscriptionSegment

​Word

​Supported Languages

​Audio Format Requirements

​Converting Audio

​Advanced Usage

​Verbose Mode with Timestamps

​Custom Encoding and Decoding

​Streaming with Real-Time Display

​Error Handling

​Best Practices

​Related

Build docs developers (and LLMs) love

Basic Usage

Hook Signature

useSpeechToText(props)

Parameters

Returns

Transcription Methods

Single-Pass Transcription

Streaming Transcription

Types

DecodingOptions

TranscriptionResult

TranscriptionSegment

Word

Supported Languages

Audio Format Requirements

Converting Audio

Advanced Usage

Verbose Mode with Timestamps

Custom Encoding and Decoding

Streaming with Real-Time Display

Error Handling

Best Practices

Related