Overview
SpeechToTextModule provides a class-based interface for Speech-to-Text (STT) functionalities. It supports both single-shot transcription and streaming transcription with Whisper-based models.
When to Use
Use SpeechToTextModule when:
- You need manual control over transcription lifecycle
- You’re working outside React components
- You need streaming transcription support
- You want to integrate speech recognition into non-React code
Use useSpeechToText hook when:
- Building React components
- You want automatic lifecycle management
- You prefer declarative state management
- You need React state integration
Constructor
Creates a new speech-to-text module instance.
Example
import { SpeechToTextModule } from 'react-native-executorch';
const stt = new SpeechToTextModule();
Methods
load()
async load(
model: SpeechToTextModelConfig,
onDownloadProgressCallback?: (progress: number) => void
): Promise<void>
Loads the speech-to-text model (encoder and decoder) and tokenizer.
Parameters
model
SpeechToTextModelConfig
required
Configuration object containing:
encoderSource: Resource location of the encoder model
decoderSource: Resource location of the decoder model
tokenizerSource: Resource location of the tokenizer
isMultilingual: Boolean indicating if the model supports multiple languages
onDownloadProgressCallback
(progress: number) => void
Optional callback to monitor download progress (value between 0 and 1).
Example
await stt.load(
{
encoderSource: 'https://example.com/whisper_encoder.pte',
decoderSource: 'https://example.com/whisper_decoder.pte',
tokenizerSource: 'https://example.com/tokenizer.json',
isMultilingual: false
},
(progress) => {
console.log(`Download: ${(progress * 100).toFixed(1)}%`);
}
);
transcribe()
async transcribe(
waveform: Float32Array,
options?: DecodingOptions
): Promise<TranscriptionResult>
Transcribes the provided audio waveform (16kHz) to text.
Parameters
Audio data as a Float32Array (mono, 16kHz sample rate).
Decoding options:
language: Language code (required for multilingual models, e.g., ‘en’, ‘es’, ‘fr’)
verbose: If true, returns detailed transcription with timestamps
Returns
A TranscriptionResult object containing:
text: The transcribed text
segments: Array of segment objects (if verbose: true)
Example
// Simple transcription
const result = await stt.transcribe(audioWaveform);
console.log('Transcription:', result.text);
// Multilingual with verbose output
const verboseResult = await stt.transcribe(audioWaveform, {
language: 'es',
verbose: true
});
console.log('Text:', verboseResult.text);
console.log('Segments:', verboseResult.segments);
stream()
async *stream(
options?: DecodingOptions
): AsyncGenerator<{
committed: TranscriptionResult;
nonCommitted: TranscriptionResult;
}>
Starts a streaming transcription session. Yields objects with committed and non-committed transcriptions.
- Committed transcription: Finalized text that will not change
- Non-committed transcription: Partial text still being processed
Use with streamInsert() and streamStop() to control the stream.
Parameters
Decoding options including language and verbose settings.
Returns
An async generator yielding transcription updates.
Example
// Start streaming session
const streamGenerator = stt.stream({ language: 'en' });
// In another part of your code, feed audio chunks
stt.streamInsert(audioChunk1);
stt.streamInsert(audioChunk2);
// Process streaming results
for await (const update of streamGenerator) {
console.log('Committed:', update.committed.text);
console.log('Partial:', update.nonCommitted.text);
// Display both for real-time feedback
setTranscript(update.committed.text + update.nonCommitted.text);
}
// Stop when done
stt.streamStop();
streamInsert()
streamInsert(waveform: Float32Array): void
Inserts a new audio chunk into the active streaming transcription session.
Parameters
Audio chunk to insert (mono, 16kHz).
Example
stt.streamInsert(audioChunk);
streamStop()
Stops the current streaming transcription session.
Example
encode()
async encode(waveform: Float32Array): Promise<Float32Array>
Runs the encoding part of the model on the provided waveform. Returns the encoded representation.
Parameters
Returns
The encoded output as a Float32Array.
Example
const encodedAudio = await stt.encode(audioWaveform);
decode()
async decode(
tokens: Int32Array,
encoderOutput: Float32Array
): Promise<Float32Array>
Runs the decoder of the model with provided tokens and encoder output.
Parameters
Returns
Decoded output as a Float32Array.
Example
const decodedOutput = await stt.decode(tokens, encoderOutput);
delete()
Unloads the model from memory.
Example
Complete Example: Single-shot Transcription
import { SpeechToTextModule } from 'react-native-executorch';
import AudioRecorder from 'react-native-audio-recorder';
class VoiceTranscriber {
private stt: SpeechToTextModule;
constructor() {
this.stt = new SpeechToTextModule();
}
async initialize() {
console.log('Loading speech-to-text model...');
await this.stt.load(
{
encoderSource: 'https://example.com/whisper_encoder.pte',
decoderSource: 'https://example.com/whisper_decoder.pte',
tokenizerSource: 'https://example.com/tokenizer.json',
isMultilingual: true
},
(progress) => {
console.log(`Loading: ${(progress * 100).toFixed(0)}%`);
}
);
console.log('Model ready!');
}
async transcribeAudio(
audioPath: string,
language: string = 'en'
): Promise<string> {
// Load and convert audio to 16kHz mono Float32Array
const waveform = await this.loadAudioFile(audioPath);
const result = await this.stt.transcribe(waveform, {
language,
verbose: false
});
return result.text;
}
private async loadAudioFile(path: string): Promise<Float32Array> {
// Implementation depends on your audio library (e.g., expo-av, react-native-sound)
const audioData = await AudioRecorder.loadFile(path);
return new Float32Array(audioData);
}
cleanup() {
this.stt.delete();
}
}
// Usage
const transcriber = new VoiceTranscriber();
await transcriber.initialize();
const text = await transcriber.transcribeAudio(
'/path/to/audio.wav',
'en'
);
console.log('Transcription:', text);
transcriber.cleanup();
Complete Example: Streaming Transcription
import { SpeechToTextModule } from 'react-native-executorch';
class StreamingTranscriber {
private stt: SpeechToTextModule;
private isStreaming = false;
constructor() {
this.stt = new SpeechToTextModule();
}
async initialize() {
await this.stt.load({
encoderSource: 'https://example.com/encoder.pte',
decoderSource: 'https://example.com/decoder.pte',
tokenizerSource: 'https://example.com/tokenizer.json',
isMultilingual: false
});
}
async startStreaming(
onTranscript: (committed: string, partial: string) => void
) {
this.isStreaming = true;
const streamGenerator = this.stt.stream({ language: 'en' });
// Process streaming results in the background
(async () => {
try {
for await (const update of streamGenerator) {
if (!this.isStreaming) break;
onTranscript(
update.committed.text,
update.nonCommitted.text
);
}
} catch (error) {
console.error('Streaming error:', error);
}
})();
}
feedAudio(audioChunk: Float32Array) {
if (this.isStreaming) {
this.stt.streamInsert(audioChunk);
}
}
stopStreaming() {
this.isStreaming = false;
this.stt.streamStop();
}
cleanup() {
this.stt.delete();
}
}
// Usage
const streamingTranscriber = new StreamingTranscriber();
await streamingTranscriber.initialize();
// Start streaming
await streamingTranscriber.startStreaming((committed, partial) => {
console.log('Committed:', committed);
console.log('Partial:', partial);
// Update UI with combined text
const fullText = committed + partial;
updateTranscriptionDisplay(fullText);
});
// Feed audio chunks as they arrive
streamingTranscriber.feedAudio(chunk1);
streamingTranscriber.feedAudio(chunk2);
// Stop when done
streamingTranscriber.stopStreaming();
streamingTranscriber.cleanup();
- Sample rate: 16kHz (16,000 Hz)
- Channels: Mono (single channel)
- Format: Float32Array with normalized values (-1.0 to 1.0)
- Duration: Recommended 30 seconds or less per chunk for best results
Multilingual Support
// For multilingual models, specify the language
const result = await stt.transcribe(waveform, {
language: 'es' // Spanish
});
// Supported languages (examples):
// 'en' - English
// 'es' - Spanish
// 'fr' - French
// 'de' - German
// 'zh' - Chinese
// 'ja' - Japanese
// 'ko' - Korean
Verbose Mode
const result = await stt.transcribe(waveform, {
language: 'en',
verbose: true
});
// Result includes segments with timestamps
console.log(result.segments);
// [
// { text: 'Hello', start: 0.0, end: 0.5 },
// { text: 'world', start: 0.5, end: 1.0 }
// ]
- Transcription speed depends on audio length and model size
- Streaming mode provides real-time feedback but uses more resources
- Use single-shot transcription for pre-recorded audio
- Always call
delete() when done to free memory
- Consider audio quality for better accuracy
See Also