SpeechToTextModule - React Native ExecuTorch

Overview

SpeechToTextModule provides a class-based interface for Speech-to-Text (STT) functionalities. It supports both single-shot transcription and streaming transcription with Whisper-based models.

When to Use

Use SpeechToTextModule when:

You need manual control over transcription lifecycle
You’re working outside React components
You need streaming transcription support
You want to integrate speech recognition into non-React code

Use useSpeechToText hook when:

Building React components
You want automatic lifecycle management
You prefer declarative state management
You need React state integration

Constructor

new SpeechToTextModule()

Creates a new speech-to-text module instance.

Example

import { SpeechToTextModule } from 'react-native-executorch';

const stt = new SpeechToTextModule();

Methods

load()

async load(
  model: SpeechToTextModelConfig,
  onDownloadProgressCallback?: (progress: number) => void
): Promise<void>

Loads the speech-to-text model (encoder and decoder) and tokenizer.

Parameters

model

SpeechToTextModelConfig

required

Configuration object containing:

encoderSource: Resource location of the encoder model
decoderSource: Resource location of the decoder model
tokenizerSource: Resource location of the tokenizer
isMultilingual: Boolean indicating if the model supports multiple languages

onDownloadProgressCallback

(progress: number) => void

Optional callback to monitor download progress (value between 0 and 1).

Example

await stt.load(
  {
    encoderSource: 'https://example.com/whisper_encoder.pte',
    decoderSource: 'https://example.com/whisper_decoder.pte',
    tokenizerSource: 'https://example.com/tokenizer.json',
    isMultilingual: false
  },
  (progress) => {
    console.log(`Download: ${(progress * 100).toFixed(1)}%`);
  }
);

transcribe()

async transcribe(
  waveform: Float32Array,
  options?: DecodingOptions
): Promise<TranscriptionResult>

Transcribes the provided audio waveform (16kHz) to text.

Parameters

waveform

Float32Array

required

Audio data as a Float32Array (mono, 16kHz sample rate).

options

DecodingOptions

Decoding options:

language: Language code (required for multilingual models, e.g., ‘en’, ‘es’, ‘fr’)
verbose: If true, returns detailed transcription with timestamps

Returns

A TranscriptionResult object containing:

text: The transcribed text
segments: Array of segment objects (if verbose: true)

Example

// Simple transcription
const result = await stt.transcribe(audioWaveform);
console.log('Transcription:', result.text);

// Multilingual with verbose output
const verboseResult = await stt.transcribe(audioWaveform, {
  language: 'es',
  verbose: true
});
console.log('Text:', verboseResult.text);
console.log('Segments:', verboseResult.segments);

stream()

async *stream(
  options?: DecodingOptions
): AsyncGenerator<{
  committed: TranscriptionResult;
  nonCommitted: TranscriptionResult;
}>

Starts a streaming transcription session. Yields objects with committed and non-committed transcriptions.

Committed transcription: Finalized text that will not change
Non-committed transcription: Partial text still being processed

Use with streamInsert() and streamStop() to control the stream.

Parameters

options

DecodingOptions

Decoding options including language and verbose settings.

Returns

An async generator yielding transcription updates.

Example

// Start streaming session
const streamGenerator = stt.stream({ language: 'en' });

// In another part of your code, feed audio chunks
stt.streamInsert(audioChunk1);
stt.streamInsert(audioChunk2);

// Process streaming results
for await (const update of streamGenerator) {
  console.log('Committed:', update.committed.text);
  console.log('Partial:', update.nonCommitted.text);
  
  // Display both for real-time feedback
  setTranscript(update.committed.text + update.nonCommitted.text);
}

// Stop when done
stt.streamStop();

streamInsert()

streamInsert(waveform: Float32Array): void

Inserts a new audio chunk into the active streaming transcription session.

Parameters

waveform

Float32Array

required

Audio chunk to insert (mono, 16kHz).

Example

stt.streamInsert(audioChunk);

streamStop()

streamStop(): void

Stops the current streaming transcription session.

Example

stt.streamStop();

encode()

async encode(waveform: Float32Array): Promise<Float32Array>

Runs the encoding part of the model on the provided waveform. Returns the encoded representation.

Parameters

waveform

Float32Array

required

Input audio waveform.

Returns

The encoded output as a Float32Array.

Example

const encodedAudio = await stt.encode(audioWaveform);

decode()

async decode(
  tokens: Int32Array,
  encoderOutput: Float32Array
): Promise<Float32Array>

Runs the decoder of the model with provided tokens and encoder output.

Parameters

tokens

Int32Array

required

Input token IDs.

encoderOutput

Float32Array

required

Output from the encoder.

Returns

Decoded output as a Float32Array.

Example

const decodedOutput = await stt.decode(tokens, encoderOutput);

delete()

delete(): void

Unloads the model from memory.

Example

stt.delete();

Complete Example: Single-shot Transcription

import { SpeechToTextModule } from 'react-native-executorch';
import AudioRecorder from 'react-native-audio-recorder';

class VoiceTranscriber {
  private stt: SpeechToTextModule;

  constructor() {
    this.stt = new SpeechToTextModule();
  }

  async initialize() {
    console.log('Loading speech-to-text model...');
    await this.stt.load(
      {
        encoderSource: 'https://example.com/whisper_encoder.pte',
        decoderSource: 'https://example.com/whisper_decoder.pte',
        tokenizerSource: 'https://example.com/tokenizer.json',
        isMultilingual: true
      },
      (progress) => {
        console.log(`Loading: ${(progress * 100).toFixed(0)}%`);
      }
    );
    console.log('Model ready!');
  }

  async transcribeAudio(
    audioPath: string,
    language: string = 'en'
  ): Promise<string> {
    // Load and convert audio to 16kHz mono Float32Array
    const waveform = await this.loadAudioFile(audioPath);
    
    const result = await this.stt.transcribe(waveform, {
      language,
      verbose: false
    });
    
    return result.text;
  }

  private async loadAudioFile(path: string): Promise<Float32Array> {
    // Implementation depends on your audio library (e.g., expo-av, react-native-sound)
    const audioData = await AudioRecorder.loadFile(path);
    return new Float32Array(audioData);
  }

  cleanup() {
    this.stt.delete();
  }
}

// Usage
const transcriber = new VoiceTranscriber();
await transcriber.initialize();

const text = await transcriber.transcribeAudio(
  '/path/to/audio.wav',
  'en'
);
console.log('Transcription:', text);

transcriber.cleanup();

Complete Example: Streaming Transcription

import { SpeechToTextModule } from 'react-native-executorch';

class StreamingTranscriber {
  private stt: SpeechToTextModule;
  private isStreaming = false;

  constructor() {
    this.stt = new SpeechToTextModule();
  }

  async initialize() {
    await this.stt.load({
      encoderSource: 'https://example.com/encoder.pte',
      decoderSource: 'https://example.com/decoder.pte',
      tokenizerSource: 'https://example.com/tokenizer.json',
      isMultilingual: false
    });
  }

  async startStreaming(
    onTranscript: (committed: string, partial: string) => void
  ) {
    this.isStreaming = true;
    const streamGenerator = this.stt.stream({ language: 'en' });

    // Process streaming results in the background
    (async () => {
      try {
        for await (const update of streamGenerator) {
          if (!this.isStreaming) break;
          
          onTranscript(
            update.committed.text,
            update.nonCommitted.text
          );
        }
      } catch (error) {
        console.error('Streaming error:', error);
      }
    })();
  }

  feedAudio(audioChunk: Float32Array) {
    if (this.isStreaming) {
      this.stt.streamInsert(audioChunk);
    }
  }

  stopStreaming() {
    this.isStreaming = false;
    this.stt.streamStop();
  }

  cleanup() {
    this.stt.delete();
  }
}

// Usage
const streamingTranscriber = new StreamingTranscriber();
await streamingTranscriber.initialize();

// Start streaming
await streamingTranscriber.startStreaming((committed, partial) => {
  console.log('Committed:', committed);
  console.log('Partial:', partial);
  
  // Update UI with combined text
  const fullText = committed + partial;
  updateTranscriptionDisplay(fullText);
});

// Feed audio chunks as they arrive
streamingTranscriber.feedAudio(chunk1);
streamingTranscriber.feedAudio(chunk2);

// Stop when done
streamingTranscriber.stopStreaming();
streamingTranscriber.cleanup();

Audio Format Requirements

Sample rate: 16kHz (16,000 Hz)
Channels: Mono (single channel)
Format: Float32Array with normalized values (-1.0 to 1.0)
Duration: Recommended 30 seconds or less per chunk for best results

Multilingual Support

// For multilingual models, specify the language
const result = await stt.transcribe(waveform, {
  language: 'es'  // Spanish
});

// Supported languages (examples):
// 'en' - English
// 'es' - Spanish
// 'fr' - French
// 'de' - German
// 'zh' - Chinese
// 'ja' - Japanese
// 'ko' - Korean

Verbose Mode

const result = await stt.transcribe(waveform, {
  language: 'en',
  verbose: true
});

// Result includes segments with timestamps
console.log(result.segments);
// [
//   { text: 'Hello', start: 0.0, end: 0.5 },
//   { text: 'world', start: 0.5, end: 1.0 }
// ]

Performance Considerations

Transcription speed depends on audio length and model size
Streaming mode provides real-time feedback but uses more resources
Use single-shot transcription for pre-recorded audio
Always call delete() when done to free memory
Consider audio quality for better accuracy

Initialization

LLM Hooks

Computer Vision Hooks

Speech Hooks

Text Embeddings Hooks

General Hooks

Modules

Types

Constants

Errors

​Overview

​When to Use

​Constructor

​Example

​Methods

​load()

​Parameters

​Example

​transcribe()

​Parameters

​Returns

​Example

​stream()

​Parameters

​Returns

​Example

​streamInsert()

​Parameters

​Example

​streamStop()

​Example

​encode()

​Parameters

​Returns

​Example

​decode()

​Parameters

​Returns

​Example

​delete()

​Example

​Complete Example: Single-shot Transcription

​Complete Example: Streaming Transcription

​Audio Format Requirements

​Multilingual Support

​Verbose Mode

​Performance Considerations

​See Also

Build docs developers (and LLMs) love

Overview

When to Use

Constructor

Example

Methods

load()

Parameters

Example

transcribe()

Parameters

Returns

Example

stream()

Parameters

Returns

Example

streamInsert()

Parameters

Example

streamStop()

Example

encode()

Parameters

Returns

Example

decode()

Parameters

Returns

Example

delete()

Example

Complete Example: Single-shot Transcription

Complete Example: Streaming Transcription

Audio Format Requirements

Multilingual Support

Verbose Mode

Performance Considerations

See Also