Skip to main content

Overview

The offline STT (Speech-to-Text) module provides complete audio file transcription using sherpa-onnx models. Use this when you have complete audio files to transcribe, as opposed to real-time streaming recognition. Key features:
  • Transcribe complete audio files or PCM samples
  • Support for multiple model types (Whisper, Paraformer, Transducer, and more)
  • Automatic model type detection
  • Hotwords support for contextual biasing (transducer models)
  • Token timestamps and language detection (model-dependent)
  • Runtime configuration updates

Quick Start

import { createSTT } from 'react-native-sherpa-onnx/stt';

// Create an STT engine with auto-detection
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/sherpa-onnx-whisper-tiny-en' },
  modelType: 'auto',
  preferInt8: true,
});

// Transcribe a WAV file
const result = await stt.transcribeFile('/path/to/audio.wav');
console.log('Transcription:', result.text);

// Access additional metadata
console.log('Tokens:', result.tokens);
console.log('Timestamps:', result.timestamps);
console.log('Language:', result.lang);

// Clean up when done
await stt.destroy();

Supported Model Types

The following model types are supported with automatic detection:
Model TypeDescriptionTypical Files
transducerTransducer modelsencoder.onnx, decoder.onnx, joiner.onnx, tokens.txt
nemo_transducerNeMo Transducerencoder.onnx, decoder.onnx, joiner.onnx, tokens.txt
paraformerParaformer modelsmodel.onnx, tokens.txt
whisperOpenAI Whisperencoder.onnx, decoder.onnx, tokens.txt
sense_voiceSenseVoice modelsmodel.onnx, tokens.txt
nemo_ctcNeMo CTC modelsmodel.onnx, tokens.txt
zipformer_ctcZipformer CTCmodel.onnx, tokens.txt
wenet_ctcWeNet CTCmodel.onnx, tokens.txt
funasr_nanoFunASR Nanoencoder_adaptor, llm, embedding, tokenizer
fire_red_asrFireRed ASRencoder, decoder
moonshineMoonshinepreprocess.onnx, encode.onnx, decode.onnx, tokens.txt
dolphinDolphinmodel.onnx, tokens.txt
canaryCanaryencoder, decoder
autoAuto-detectDetects based on files present
Use modelType: 'auto' to automatically detect the model type based on files in the directory. This is the recommended approach.

API Reference

createSTT(options)

Creates an STT engine instance for offline transcription.
src/stt/index.ts
export async function createSTT(
  options: STTInitializeOptions | ModelPathConfig
): Promise<SttEngine>;
Options:
modelPath
ModelPathConfig
required
Model directory path configuration. Can be:
  • { type: 'asset', path: 'models/...' } for bundled assets
  • { type: 'file', path: '/absolute/path' } for filesystem
  • { type: 'auto', path: '...' } to try asset then file
modelType
STTModelType
default:"auto"
Model type to use. Set to 'auto' for automatic detection based on files.
preferInt8
boolean
Prefer int8 quantized models (faster, smaller) when available.
  • true: Prefer int8 models
  • false: Prefer full precision
  • undefined: Try int8 first, fallback to full precision (default)
numThreads
number
default:"1"
Number of threads for inference.
provider
string
default:"cpu"
Execution provider (e.g., 'cpu', 'qnn', 'nnapi', 'xnnpack'). See Execution Providers for details.
hotwordsFile
string
Path to hotwords file for contextual biasing. Only supported for transducer models (transducer, nemo_transducer).
hotwordsScore
number
default:"1.5"
Hotwords boost score (only applies when hotwordsFile is set).
debug
boolean
default:"false"
Enable debug logging in native layer.
modelOptions
SttModelOptions
Model-specific options. Only the block for the loaded model type is applied:
  • whisper: { language, task, tailPaddings, enableTokenTimestamps, enableSegmentTimestamps }
  • senseVoice: { language, useItn }
  • canary: { srcLang, tgtLang, usePnc }
  • funasrNano: { systemPrompt, userPrompt, maxNewTokens, temperature, topP, seed, language, itn, hotwords }

SttEngine: transcribeFile(filePath)

Transcribe a complete audio file.
const result = await stt.transcribeFile('/path/to/audio.wav');
Input requirements:
  • Format: WAV (PCM)
  • Sample rate: 16 kHz (recommended, model-dependent)
  • Channels: Mono
  • Bit depth: 16-bit
Returns SttRecognitionResult:
interface SttRecognitionResult {
  text: string;          // Transcribed text
  tokens: string[];      // Token strings
  timestamps: number[];  // Timestamps per token (model-dependent)
  lang: string;          // Detected/specified language
  emotion: string;       // Emotion label (SenseVoice)
  event: string;         // Event label (model-dependent)
  durations: number[];   // Duration info (TDT models)
}
Audio format is critical. Most models expect 16 kHz mono WAV. Use ffmpeg to convert:
ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 output.wav

SttEngine: transcribeSamples(samples, sampleRate)

Transcribe from raw PCM samples (e.g., from microphone or decoder).
const result = await stt.transcribeSamples(
  samples,    // Float32Array or number[] in [-1, 1]
  16000       // Sample rate in Hz
);
Parameters:
  • samples: Float PCM samples in range [-1, 1], mono
  • sampleRate: Sample rate in Hz

SttEngine: setConfig(config)

Update recognizer configuration at runtime.
await stt.setConfig({
  decodingMethod: 'greedy_search',
  maxActivePaths: 4,
  hotwordsFile: '/path/to/hotwords.txt',
  hotwordsScore: 1.5,
  blankPenalty: 0.0,
  ruleFsts: '/path/to/rule.fst',
  ruleFars: '/path/to/rule.far',
});

SttEngine: destroy()

Release native resources. Must be called when the engine is no longer needed.
await stt.destroy();

Model-Specific Options

Whisper Models

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/sherpa-onnx-whisper-tiny' },
  modelType: 'whisper',
  modelOptions: {
    whisper: {
      language: 'de',           // Language code
      task: 'transcribe',       // 'transcribe' or 'translate'
      tailPaddings: 1000,       // Padding at end
      enableTokenTimestamps: true,    // Android only
      enableSegmentTimestamps: true,  // Android only
    },
  },
});
Language codes must be valid. Invalid language codes can crash the app. Use getWhisperLanguages() to get the list of supported languages:
import { getWhisperLanguages } from 'react-native-sherpa-onnx/stt';

const languages = getWhisperLanguages();
// [{ id: 'en', name: 'english' }, { id: 'de', name: 'german' }, ...]

SenseVoice Models

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/sherpa-onnx-sense-voice' },
  modelType: 'sense_voice',
  modelOptions: {
    senseVoice: {
      language: 'zh',     // 'auto', 'zh', 'en', 'yue', 'ja', 'ko'
      useItn: true,       // Inverse text normalization
    },
  },
});

Canary Models

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/sherpa-onnx-canary' },
  modelType: 'canary',
  modelOptions: {
    canary: {
      srcLang: 'en',      // Source language
      tgtLang: 'es',      // Target language
      usePnc: true,       // Use punctuation
    },
  },
});

Hotwords (Contextual Biasing)

Hotwords allow you to boost recognition of specific words or phrases.
Hotwords are only supported for transducer models (transducer, nemo_transducer). Check support before showing UI:
import { sttSupportsHotwords } from 'react-native-sherpa-onnx/stt';

if (sttSupportsHotwords(modelType)) {
  // Show hotwords options
}

Hotwords File Format

Create a text file with one phrase per line, optionally with a boost factor:
REACT NATIVE 2.0
SHERPA ONNX
TURBOMODULE 1.5

Using Hotwords

// At initialization
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/transducer-en' },
  modelType: 'transducer',
  hotwordsFile: '/path/to/hotwords.txt',
  hotwordsScore: 1.5,
  modelingUnit: 'bpe',      // Required for BPE models
  bpeVocab: '/path/to/bpe.vocab',  // Required when modelingUnit is 'bpe'
});

// Or update at runtime
await stt.setConfig({
  hotwordsFile: '/path/to/new-hotwords.txt',
  hotwordsScore: 2.0,
});

Model Detection

Detect model type without initializing:
import { detectSttModel } from 'react-native-sherpa-onnx/stt';

const result = await detectSttModel(
  { type: 'asset', path: 'models/sherpa-onnx-whisper-tiny-en' },
  { preferInt8: true }
);

if (result.success && result.modelType === 'whisper') {
  // Show Whisper-specific options
  console.log('Detected:', result.modelType);
  console.log('Models:', result.detectedModels);
}

Performance Optimization

Quantization

Int8 quantized models are faster and use less memory:
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-tiny' },
  preferInt8: true,  // Prefer model.int8.onnx over model.onnx
});

Threading

Increase threads for faster processing on multi-core devices:
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/paraformer' },
  numThreads: 4,  // More threads = faster, but more CPU
});

Hardware Acceleration

Use hardware acceleration when available:
import { getQnnSupport, getNnapiSupport } from 'react-native-sherpa-onnx';

// Check QNN (Qualcomm NPU) support
const qnnSupport = await getQnnSupport();
if (qnnSupport.canInit) {
  const stt = await createSTT({
    modelPath: { type: 'asset', path: 'models/transducer' },
    provider: 'qnn',  // Use Qualcomm NPU
  });
}

// Check NNAPI (Android GPU/DSP/NPU) support
const nnapiSupport = await getNnapiSupport();
if (nnapiSupport.canInit) {
  const stt = await createSTT({
    modelPath: { type: 'asset', path: 'models/paraformer' },
    provider: 'nnapi',  // Use NNAPI
  });
}
See Execution Providers for detailed information on hardware acceleration.

Common Use Cases

Transcribe with Language Detection

import { createSTT } from 'react-native-sherpa-onnx/stt';

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-multilingual' },
  modelType: 'whisper',
});

const result = await stt.transcribeFile(audioPath);
console.log('Text:', result.text);
console.log('Detected language:', result.lang);

await stt.destroy();

Batch Processing Multiple Files

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/paraformer-zh' },
  numThreads: 4,
});

const files = ['/path/1.wav', '/path/2.wav', '/path/3.wav'];
const results = [];

for (const file of files) {
  const result = await stt.transcribeFile(file);
  results.push({ file, text: result.text });
}

await stt.destroy();

Troubleshooting

  • Verify the model directory exists and contains all required files
  • Check that model files match the expected structure for the model type
  • Try modelType: 'auto' to let the SDK detect the type
  • Enable debug: true to see detailed initialization logs
  • Ensure audio is 16 kHz mono WAV (most models)
  • Check audio quality and noise levels
  • Try a larger/better model
  • Use preferInt8: false for full precision
  • Verify the model type supports hotwords (transducer, nemo_transducer only)
  • Check modelingUnit and bpeVocab are set correctly for BPE models
  • Ensure hotwords file format is correct (one phrase per line)
  • Increase hotwordsScore for stronger boosting
  • Use preferInt8: true for smaller models
  • Reduce numThreads
  • Process shorter audio segments
  • Close other apps to free memory

Next Steps

Streaming STT

Real-time recognition with partial results

Model Setup

Learn how to bundle and load models

Execution Providers

Hardware acceleration (QNN, NNAPI, XNNPACK)

Text-to-Speech

Convert text to speech

Build docs developers (and LLMs) love