Speech to Text Types
SpeechToTextProps
Configuration for Speech to Text model.Configuration object containing model sources.
Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.
SpeechToTextType
React hook for managing Speech to Text (STT) instance.Contains the error message if the model failed to load.
Indicates whether the model has successfully loaded and is ready for inference.
Indicates whether the model is currently processing an inference.
Tracks the progress of the model download process.
Runs the encoding part of the model on the provided waveform.Parameters:
waveform(Float32Array) - The input audio waveform array.
Runs the decoder of the model.Parameters:
tokens(Int32Array) - The encoded audio data.encoderOutput(Float32Array) - The output from the encoder.
Starts a transcription process for a given input array, which should be a waveform at 16kHz.Parameters:
waveform(Float32Array) - The input audio waveform.options(DecodingOptions, optional) - Decoding options, check API reference for more details.
TranscriptionResult.Starts a streaming transcription process. Use in combination with
streamInsert to feed audio chunks and streamStop to end the stream. Updates committedTranscription and nonCommittedTranscription as transcription progresses.Parameters:options(DecodingOptions, optional) - Decoding options including language.
committed and nonCommitted transcription. Both committed and nonCommitted are of type TranscriptionResult.Inserts a chunk of audio data (sampled at 16kHz) into the ongoing streaming transcription.Parameters:
waveform(Float32Array) - The audio chunk to insert.
Stops the ongoing streaming transcription process.
SpeechToTextLanguage
Languages supported by whisper (not whisper.en).DecodingOptions
Options for decoding speech to text.Optional language code to guide the transcription.
Optional flag. If set, transcription result is presented with timestamps and with additional parameters. For more details please refer to
TranscriptionResult.Word
Structure that represent single token with timestamp information.Token as a string value.
Timestamp of the beginning of the token in audio (in seconds).
Timestamp of the end of the token in audio (in seconds).
TranscriptionSegment
Structure that represent single Segment of transcription.Timestamp of the beginning of the segment in audio (in seconds).
Timestamp of the end of the segment in audio (in seconds).
Full text of the given segment as a string.
If
verbose set to true in DecodingOptions, it returns word-level timestamping as an array of Word.Raw tokens represented as table of integers.
Temperature for which given segment was computed.
Average log probability calculated across all tokens in a segment.
Compression ration achieved on a given segment.
TranscriptionResult
Structure that represent result of transcription for a one function call (eithertranscribe or stream).
String indicating task, either ‘transcribe’ or ‘stream’.
Language chosen for transcription.
Duration in seconds of a given transcription.
The whole text of a transcription as a
string.If
verbose set to true in DecodingOptions, it contains array of TranscriptionSegment with details split into separate transcription segments.SpeechToTextModelConfig
Configuration for Speech to Text model.A boolean flag indicating whether the model supports multiple languages.
A string that specifies the location of a
.pte file for the encoder.A string that specifies the location of a
.pte file for the decoder.A string that specifies the location to the tokenizer for the model.
Text to Speech Types
TextToSpeechLanguage
List all the languages available in TTS models (as lang shorthands).VoiceConfig
Voice configuration. So far in Kokoro, each voice is directly associated with a language.Speaker’s language.
A source to a binary file with voice embedding.
An optional extra sources or properties related to specific voice.
KokoroVoiceExtras
Kokoro-specific voice extra props.Source to Kokoro’s tagger model binary.
Source to Kokoro’s lexicon binary.
KokoroConfig
Kokoro model configuration. Only the core Kokoro model sources, as phonemizer sources are included in voice configuration.Model type identifier.
Source to Kokoro’s duration predictor model binary.
Source to Kokoro’s synthesizer model binary.
TextToSpeechConfig
General Text to Speech module configuration.A selected T2S model.
A selected speaker’s voice.
TextToSpeechProps
Props for the useTextToSpeech hook.A selected T2S model.
A selected speaker’s voice.
Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.
TextToSpeechInput
Text to Speech module input definition.A text to be spoken.
Optional speed argument - the higher it is, the faster the speech becomes.
TextToSpeechType
Return type for theuseTextToSpeech hook. Manages the state and operations for Text-to-Speech generation.
Contains the error object if the model failed to load or encountered an error during inference.
Indicates whether the Text-to-Speech model is loaded and ready to accept inputs.
Indicates whether the model is currently generating audio.
Represents the download progress of the model and voice assets as a value between 0 and 1.
Runs the model to convert the provided text into speech audio in a single pass.Parameters:
input(TextToSpeechInput) - TheTextToSpeechInputobject containing thetextto synthesize and optionalspeed.
Float32Array).Throws: RnExecutorchError if the model is not loaded or is currently generating.Streams the generated audio data incrementally. This is optimal for real-time playback, allowing audio to start playing before the full text is synthesized.Parameters:
input(TextToSpeechStreamingInput) - TheTextToSpeechStreamingInputobject containingtext, optionalspeed, and lifecycle callbacks (onBegin,onNext,onEnd).
RnExecutorchError if the model is not loaded or is currently generating.Interrupts and stops the currently active audio generation stream.
TextToSpeechStreamingInput
Text to Speech streaming input definition. Streaming mode in T2S is synchronized by passing specific callbacks executed at given moments of the streaming. Actions such as playing the audio should happen within the onNext callback. Callbacks can be both synchronous or asynchronous.A text to be spoken.
Optional speed argument - the higher it is, the faster the speech becomes.
Called when streaming begins.
Called after each audio chunk gets calculated.
Called when streaming ends.
Voice Activity Detection Types
VADProps
Props for the useVAD hook.An object containing the model source.
Boolean that can prevent automatic model loading (and downloading the data if you load it for the first time) after running the hook.
Segment
Represents a detected audio segment with start and end timestamps.Start time of the segment in seconds.
End time of the segment in seconds.
VADType
React hook state and methods for managing a Voice Activity Detection (VAD) model instance.Contains the error message if the VAD model failed to load or during processing.
Indicates whether the VAD model has successfully loaded and is ready for inference.
Indicates whether the model is currently processing an inference.
Represents the download progress as a value between 0 and 1.
Runs the Voice Activity Detection model on the provided audio waveform.Parameters:
waveform(Float32Array) - The input audio waveform array.
RnExecutorchError if the model is not loaded or is currently processing another request.