Skip to main content

Speech to Text Types

SpeechToTextProps

Configuration for Speech to Text model.
model
SpeechToTextModelConfig
required
Configuration object containing model sources.
preventLoad
boolean
Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

SpeechToTextType

React hook for managing Speech to Text (STT) instance.
error
RnExecutorchError | null
Contains the error message if the model failed to load.
isReady
boolean
Indicates whether the model has successfully loaded and is ready for inference.
isGenerating
boolean
Indicates whether the model is currently processing an inference.
downloadProgress
number
Tracks the progress of the model download process.
encode
(waveform: Float32Array) => Promise<Float32Array>
Runs the encoding part of the model on the provided waveform.Parameters:
  • waveform (Float32Array) - The input audio waveform array.
Returns: A promise resolving to the encoded data.
decode
(tokens: Int32Array, encoderOutput: Float32Array) => Promise<Float32Array>
Runs the decoder of the model.Parameters:
  • tokens (Int32Array) - The encoded audio data.
  • encoderOutput (Float32Array) - The output from the encoder.
Returns: A promise resolving to the decoded text.
transcribe
(waveform: Float32Array, options?: DecodingOptions) => Promise<TranscriptionResult>
Starts a transcription process for a given input array, which should be a waveform at 16kHz.Parameters:
  • waveform (Float32Array) - The input audio waveform.
  • options (DecodingOptions, optional) - Decoding options, check API reference for more details.
Returns: Resolves a promise with the output transcription. Result of transcription is object of type TranscriptionResult.
stream
(options?: DecodingOptions) => AsyncGenerator
Starts a streaming transcription process. Use in combination with streamInsert to feed audio chunks and streamStop to end the stream. Updates committedTranscription and nonCommittedTranscription as transcription progresses.Parameters:
  • options (DecodingOptions, optional) - Decoding options including language.
Returns: Asynchronous generator that returns committed and nonCommitted transcription. Both committed and nonCommitted are of type TranscriptionResult.
streamInsert
(waveform: Float32Array) => void
Inserts a chunk of audio data (sampled at 16kHz) into the ongoing streaming transcription.Parameters:
  • waveform (Float32Array) - The audio chunk to insert.
streamStop
() => void
Stops the ongoing streaming transcription process.

SpeechToTextLanguage

Languages supported by whisper (not whisper.en).
type SpeechToTextLanguage =
  | 'af' | 'sq' | 'ar' | 'hy' | 'az' | 'eu' | 'be' | 'bn' | 'bs' | 'bg'
  | 'my' | 'ca' | 'zh' | 'hr' | 'cs' | 'da' | 'nl' | 'et' | 'en' | 'fi'
  | 'fr' | 'gl' | 'ka' | 'de' | 'el' | 'gu' | 'ht' | 'he' | 'hi' | 'hu'
  | 'is' | 'id' | 'it' | 'ja' | 'kn' | 'kk' | 'km' | 'ko' | 'lo' | 'lv'
  | 'lt' | 'mk' | 'mg' | 'ms' | 'ml' | 'mt' | 'mr' | 'ne' | 'no' | 'fa'
  | 'pl' | 'pt' | 'pa' | 'ro' | 'ru' | 'sr' | 'si' | 'sk' | 'sl' | 'es'
  | 'su' | 'sw' | 'sv' | 'tl' | 'tg' | 'ta' | 'te' | 'th' | 'tr' | 'uk'
  | 'ur' | 'uz' | 'vi' | 'cy' | 'yi';

DecodingOptions

Options for decoding speech to text.
language
SpeechToTextLanguage
Optional language code to guide the transcription.
verbose
boolean
Optional flag. If set, transcription result is presented with timestamps and with additional parameters. For more details please refer to TranscriptionResult.

Word

Structure that represent single token with timestamp information.
word
string
required
Token as a string value.
start
number
required
Timestamp of the beginning of the token in audio (in seconds).
end
number
required
Timestamp of the end of the token in audio (in seconds).

TranscriptionSegment

Structure that represent single Segment of transcription.
start
number
required
Timestamp of the beginning of the segment in audio (in seconds).
end
number
required
Timestamp of the end of the segment in audio (in seconds).
text
string
required
Full text of the given segment as a string.
words
Word[]
If verbose set to true in DecodingOptions, it returns word-level timestamping as an array of Word.
tokens
number[]
required
Raw tokens represented as table of integers.
temperature
number
required
Temperature for which given segment was computed.
avgLogprob
number
required
Average log probability calculated across all tokens in a segment.
compressionRatio
number
required
Compression ration achieved on a given segment.

TranscriptionResult

Structure that represent result of transcription for a one function call (either transcribe or stream).
task
'transcribe' | 'stream'
String indicating task, either ‘transcribe’ or ‘stream’.
language
string
required
Language chosen for transcription.
duration
number
required
Duration in seconds of a given transcription.
text
string
required
The whole text of a transcription as a string.
segments
TranscriptionSegment[]
If verbose set to true in DecodingOptions, it contains array of TranscriptionSegment with details split into separate transcription segments.

SpeechToTextModelConfig

Configuration for Speech to Text model.
isMultilingual
boolean
required
A boolean flag indicating whether the model supports multiple languages.
encoderSource
ResourceSource
required
A string that specifies the location of a .pte file for the encoder.
decoderSource
ResourceSource
required
A string that specifies the location of a .pte file for the decoder.
tokenizerSource
ResourceSource
required
A string that specifies the location to the tokenizer for the model.

Text to Speech Types

TextToSpeechLanguage

List all the languages available in TTS models (as lang shorthands).
type TextToSpeechLanguage =
  | 'en-us' // American English
  | 'en-gb'; // British English

VoiceConfig

Voice configuration. So far in Kokoro, each voice is directly associated with a language.
lang
TextToSpeechLanguage
required
Speaker’s language.
voiceSource
ResourceSource
required
A source to a binary file with voice embedding.
extra
KokoroVoiceExtras
An optional extra sources or properties related to specific voice.

KokoroVoiceExtras

Kokoro-specific voice extra props.
taggerSource
ResourceSource
required
Source to Kokoro’s tagger model binary.
lexiconSource
ResourceSource
required
Source to Kokoro’s lexicon binary.

KokoroConfig

Kokoro model configuration. Only the core Kokoro model sources, as phonemizer sources are included in voice configuration.
type
'kokoro'
required
Model type identifier.
durationPredictorSource
ResourceSource
required
Source to Kokoro’s duration predictor model binary.
synthesizerSource
ResourceSource
required
Source to Kokoro’s synthesizer model binary.

TextToSpeechConfig

General Text to Speech module configuration.
model
KokoroConfig
required
A selected T2S model.
voice
VoiceConfig
required
A selected speaker’s voice.

TextToSpeechProps

Props for the useTextToSpeech hook.
model
KokoroConfig
required
A selected T2S model.
voice
VoiceConfig
required
A selected speaker’s voice.
preventLoad
boolean
Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

TextToSpeechInput

Text to Speech module input definition.
text
string
required
A text to be spoken.
speed
number
Optional speed argument - the higher it is, the faster the speech becomes.

TextToSpeechType

Return type for the useTextToSpeech hook. Manages the state and operations for Text-to-Speech generation.
error
RnExecutorchError | null
Contains the error object if the model failed to load or encountered an error during inference.
isReady
boolean
Indicates whether the Text-to-Speech model is loaded and ready to accept inputs.
isGenerating
boolean
Indicates whether the model is currently generating audio.
downloadProgress
number
Represents the download progress of the model and voice assets as a value between 0 and 1.
forward
(input: TextToSpeechInput) => Promise<Float32Array>
Runs the model to convert the provided text into speech audio in a single pass.Parameters:
  • input (TextToSpeechInput) - The TextToSpeechInput object containing the text to synthesize and optional speed.
Returns: A Promise that resolves with the generated audio data (typically a Float32Array).Throws: RnExecutorchError if the model is not loaded or is currently generating.
stream
(input: TextToSpeechStreamingInput) => Promise<void>
Streams the generated audio data incrementally. This is optimal for real-time playback, allowing audio to start playing before the full text is synthesized.Parameters:
  • input (TextToSpeechStreamingInput) - The TextToSpeechStreamingInput object containing text, optional speed, and lifecycle callbacks (onBegin, onNext, onEnd).
Returns: A Promise that resolves when the streaming process is complete.Throws: RnExecutorchError if the model is not loaded or is currently generating.
streamStop
() => void
Interrupts and stops the currently active audio generation stream.

TextToSpeechStreamingInput

Text to Speech streaming input definition. Streaming mode in T2S is synchronized by passing specific callbacks executed at given moments of the streaming. Actions such as playing the audio should happen within the onNext callback. Callbacks can be both synchronous or asynchronous.
text
string
required
A text to be spoken.
speed
number
Optional speed argument - the higher it is, the faster the speech becomes.
onBegin
() => void | Promise<void>
Called when streaming begins.
onNext
(audio: Float32Array) => void | Promise<void>
Called after each audio chunk gets calculated.
onEnd
() => void | Promise<void>
Called when streaming ends.

Voice Activity Detection Types

VADProps

Props for the useVAD hook.
model
object
required
An object containing the model source.
preventLoad
boolean
Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

Segment

Represents a detected audio segment with start and end timestamps.
start
number
required
Start time of the segment in seconds.
end
number
required
End time of the segment in seconds.

VADType

React hook state and methods for managing a Voice Activity Detection (VAD) model instance.
error
RnExecutorchError | null
Contains the error message if the VAD model failed to load or during processing.
isReady
boolean
Indicates whether the VAD model has successfully loaded and is ready for inference.
isGenerating
boolean
Indicates whether the model is currently processing an inference.
downloadProgress
number
Represents the download progress as a value between 0 and 1.
forward
(waveform: Float32Array) => Promise<Segment[]>
Runs the Voice Activity Detection model on the provided audio waveform.Parameters:
  • waveform (Float32Array) - The input audio waveform array.
Returns: A promise resolving to an array of detected audio segments (e.g., timestamps for speech).Throws: RnExecutorchError if the model is not loaded or is currently processing another request.

Build docs developers (and LLMs) love