Speech Types

Speech to Text Types

SpeechToTextProps

Configuration for Speech to Text model.

model

SpeechToTextModelConfig

required

Configuration object containing model sources.

preventLoad

boolean

Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

SpeechToTextType

React hook for managing Speech to Text (STT) instance.

error

RnExecutorchError | null

Contains the error message if the model failed to load.

isReady

boolean

Indicates whether the model has successfully loaded and is ready for inference.

isGenerating

boolean

Indicates whether the model is currently processing an inference.

downloadProgress

number

Tracks the progress of the model download process.

encode

(waveform: Float32Array) => Promise<Float32Array>

Runs the encoding part of the model on the provided waveform.Parameters:

waveform (Float32Array) - The input audio waveform array.

Returns: A promise resolving to the encoded data.

decode

(tokens: Int32Array, encoderOutput: Float32Array) => Promise<Float32Array>

Runs the decoder of the model.Parameters:

tokens (Int32Array) - The encoded audio data.
encoderOutput (Float32Array) - The output from the encoder.

Returns: A promise resolving to the decoded text.

transcribe

(waveform: Float32Array, options?: DecodingOptions) => Promise<TranscriptionResult>

Starts a transcription process for a given input array, which should be a waveform at 16kHz.Parameters:

waveform (Float32Array) - The input audio waveform.
options (DecodingOptions, optional) - Decoding options, check API reference for more details.

Returns: Resolves a promise with the output transcription. Result of transcription is object of type TranscriptionResult.

stream

(options?: DecodingOptions) => AsyncGenerator

Starts a streaming transcription process. Use in combination with streamInsert to feed audio chunks and streamStop to end the stream. Updates committedTranscription and nonCommittedTranscription as transcription progresses.Parameters:

options (DecodingOptions, optional) - Decoding options including language.

Returns: Asynchronous generator that returns committed and nonCommitted transcription. Both committed and nonCommitted are of type TranscriptionResult.

streamInsert

(waveform: Float32Array) => void

Inserts a chunk of audio data (sampled at 16kHz) into the ongoing streaming transcription.Parameters:

waveform (Float32Array) - The audio chunk to insert.

streamStop

() => void

Stops the ongoing streaming transcription process.

SpeechToTextLanguage

Languages supported by whisper (not whisper.en).

type SpeechToTextLanguage =
  | 'af' | 'sq' | 'ar' | 'hy' | 'az' | 'eu' | 'be' | 'bn' | 'bs' | 'bg'
  | 'my' | 'ca' | 'zh' | 'hr' | 'cs' | 'da' | 'nl' | 'et' | 'en' | 'fi'
  | 'fr' | 'gl' | 'ka' | 'de' | 'el' | 'gu' | 'ht' | 'he' | 'hi' | 'hu'
  | 'is' | 'id' | 'it' | 'ja' | 'kn' | 'kk' | 'km' | 'ko' | 'lo' | 'lv'
  | 'lt' | 'mk' | 'mg' | 'ms' | 'ml' | 'mt' | 'mr' | 'ne' | 'no' | 'fa'
  | 'pl' | 'pt' | 'pa' | 'ro' | 'ru' | 'sr' | 'si' | 'sk' | 'sl' | 'es'
  | 'su' | 'sw' | 'sv' | 'tl' | 'tg' | 'ta' | 'te' | 'th' | 'tr' | 'uk'
  | 'ur' | 'uz' | 'vi' | 'cy' | 'yi';

DecodingOptions

Options for decoding speech to text.

language

SpeechToTextLanguage

Optional language code to guide the transcription.

verbose

boolean

Optional flag. If set, transcription result is presented with timestamps and with additional parameters. For more details please refer to TranscriptionResult.

Word

Structure that represent single token with timestamp information.

word

string

required

Token as a string value.

start

number

required

Timestamp of the beginning of the token in audio (in seconds).

end

number

required

Timestamp of the end of the token in audio (in seconds).

TranscriptionSegment

Structure that represent single Segment of transcription.

start

number

required

Timestamp of the beginning of the segment in audio (in seconds).

end

number

required

Timestamp of the end of the segment in audio (in seconds).

text

string

required

Full text of the given segment as a string.

words

Word[]

If verbose set to true in DecodingOptions, it returns word-level timestamping as an array of Word.

tokens

number[]

required

Raw tokens represented as table of integers.

temperature

number

required

Temperature for which given segment was computed.

avgLogprob

number

required

Average log probability calculated across all tokens in a segment.

compressionRatio

number

required

Compression ration achieved on a given segment.

TranscriptionResult

Structure that represent result of transcription for a one function call (either transcribe or stream).

task

'transcribe' | 'stream'

String indicating task, either ‘transcribe’ or ‘stream’.

language

string

required

Language chosen for transcription.

duration

number

required

Duration in seconds of a given transcription.

text

string

required

The whole text of a transcription as a string.

segments

TranscriptionSegment[]

If verbose set to true in DecodingOptions, it contains array of TranscriptionSegment with details split into separate transcription segments.

SpeechToTextModelConfig

Configuration for Speech to Text model.

isMultilingual

boolean

required

A boolean flag indicating whether the model supports multiple languages.

encoderSource

ResourceSource

required

A string that specifies the location of a .pte file for the encoder.

decoderSource

ResourceSource

required

A string that specifies the location of a .pte file for the decoder.

tokenizerSource

ResourceSource

required

A string that specifies the location to the tokenizer for the model.

Text to Speech Types

TextToSpeechLanguage

List all the languages available in TTS models (as lang shorthands).

type TextToSpeechLanguage =
  | 'en-us' // American English
  | 'en-gb'; // British English

VoiceConfig

Voice configuration. So far in Kokoro, each voice is directly associated with a language.

lang

TextToSpeechLanguage

required

Speaker’s language.

voiceSource

ResourceSource

required

A source to a binary file with voice embedding.

extra

KokoroVoiceExtras

An optional extra sources or properties related to specific voice.

KokoroVoiceExtras

Kokoro-specific voice extra props.

taggerSource

ResourceSource

required

Source to Kokoro’s tagger model binary.

lexiconSource

ResourceSource

required

Source to Kokoro’s lexicon binary.

KokoroConfig

Kokoro model configuration. Only the core Kokoro model sources, as phonemizer sources are included in voice configuration.

type

'kokoro'

required

Model type identifier.

durationPredictorSource

ResourceSource

required

Source to Kokoro’s duration predictor model binary.

synthesizerSource

ResourceSource

required

Source to Kokoro’s synthesizer model binary.

TextToSpeechConfig

General Text to Speech module configuration.

model

KokoroConfig

required

A selected T2S model.

voice

VoiceConfig

required

A selected speaker’s voice.

TextToSpeechProps

Props for the useTextToSpeech hook.

model

KokoroConfig

required

A selected T2S model.

voice

VoiceConfig

required

A selected speaker’s voice.

preventLoad

boolean

Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

TextToSpeechInput

Text to Speech module input definition.

text

string

required

A text to be spoken.

speed

number

Optional speed argument - the higher it is, the faster the speech becomes.

TextToSpeechType

Return type for the useTextToSpeech hook. Manages the state and operations for Text-to-Speech generation.

error

RnExecutorchError | null

Contains the error object if the model failed to load or encountered an error during inference.

isReady

boolean

Indicates whether the Text-to-Speech model is loaded and ready to accept inputs.

isGenerating

boolean

Indicates whether the model is currently generating audio.

downloadProgress

number

Represents the download progress of the model and voice assets as a value between 0 and 1.

forward

(input: TextToSpeechInput) => Promise<Float32Array>

Runs the model to convert the provided text into speech audio in a single pass.Parameters:

input (TextToSpeechInput) - The TextToSpeechInput object containing the text to synthesize and optional speed.

Returns: A Promise that resolves with the generated audio data (typically a Float32Array).Throws: RnExecutorchError if the model is not loaded or is currently generating.

stream

(input: TextToSpeechStreamingInput) => Promise<void>

Streams the generated audio data incrementally. This is optimal for real-time playback, allowing audio to start playing before the full text is synthesized.Parameters:

input (TextToSpeechStreamingInput) - The TextToSpeechStreamingInput object containing text, optional speed, and lifecycle callbacks (onBegin, onNext, onEnd).

Returns: A Promise that resolves when the streaming process is complete.Throws: RnExecutorchError if the model is not loaded or is currently generating.

streamStop

() => void

Interrupts and stops the currently active audio generation stream.

TextToSpeechStreamingInput

Text to Speech streaming input definition. Streaming mode in T2S is synchronized by passing specific callbacks executed at given moments of the streaming. Actions such as playing the audio should happen within the onNext callback. Callbacks can be both synchronous or asynchronous.

text

string

required

A text to be spoken.

speed

number

Optional speed argument - the higher it is, the faster the speech becomes.

onBegin

() => void | Promise<void>

Called when streaming begins.

onNext

(audio: Float32Array) => void | Promise<void>

Called after each audio chunk gets calculated.

onEnd

() => void | Promise<void>

Called when streaming ends.

Voice Activity Detection Types

VADProps

Props for the useVAD hook.

model

object

required

An object containing the model source.

Show properties

modelSource

ResourceSource

required

The source of the VAD model binary.

preventLoad

boolean

Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.

Segment

Represents a detected audio segment with start and end timestamps.

start

number

required

Start time of the segment in seconds.

end

number

required

End time of the segment in seconds.

VADType

React hook state and methods for managing a Voice Activity Detection (VAD) model instance.

error

RnExecutorchError | null

Contains the error message if the VAD model failed to load or during processing.

isReady

boolean

Indicates whether the VAD model has successfully loaded and is ready for inference.

isGenerating

boolean

Indicates whether the model is currently processing an inference.

downloadProgress

number

Represents the download progress as a value between 0 and 1.

forward

(waveform: Float32Array) => Promise<Segment[]>

Runs the Voice Activity Detection model on the provided audio waveform.Parameters:

waveform (Float32Array) - The input audio waveform array.

Returns: A promise resolving to an array of detected audio segments (e.g., timestamps for speech).Throws: RnExecutorchError if the model is not loaded or is currently processing another request.

Initialization

LLM Hooks

Computer Vision Hooks

Speech Hooks

Text Embeddings Hooks

General Hooks

Modules

Types

Constants

Errors

Speech to Text Types

SpeechToTextProps

SpeechToTextType

SpeechToTextLanguage

DecodingOptions

Word

TranscriptionSegment

TranscriptionResult

SpeechToTextModelConfig

Text to Speech Types

TextToSpeechLanguage

VoiceConfig

KokoroVoiceExtras

KokoroConfig

TextToSpeechConfig

TextToSpeechProps

TextToSpeechInput

TextToSpeechType

TextToSpeechStreamingInput

Voice Activity Detection Types

VADProps

Segment

VADType

Build docs developers (and LLMs) love

Initialization

LLM Hooks

Computer Vision Hooks

Speech Hooks

Text Embeddings Hooks

General Hooks

Modules

Types

Constants

Errors

​Speech to Text Types

​SpeechToTextProps

​SpeechToTextType

​SpeechToTextLanguage

​DecodingOptions

​Word

​TranscriptionSegment

​TranscriptionResult

​SpeechToTextModelConfig

​Text to Speech Types

​TextToSpeechLanguage

​VoiceConfig

​KokoroVoiceExtras

​KokoroConfig

​TextToSpeechConfig

​TextToSpeechProps

​TextToSpeechInput

​TextToSpeechType

​TextToSpeechStreamingInput

​Voice Activity Detection Types

​VADProps

​Segment

​VADType

Build docs developers (and LLMs) love

Speech to Text Types

SpeechToTextProps

SpeechToTextType

SpeechToTextLanguage

DecodingOptions

Word

TranscriptionSegment

TranscriptionResult

SpeechToTextModelConfig

Text to Speech Types

TextToSpeechLanguage

VoiceConfig

KokoroVoiceExtras

KokoroConfig

TextToSpeechConfig

TextToSpeechProps

TextToSpeechInput

TextToSpeechType

TextToSpeechStreamingInput

Voice Activity Detection Types

VADProps

Segment

VADType