Whisper Models

Whisper is OpenAI’s multilingual speech recognition model with excellent zero-shot performance across 90+ languages. It’s robust to diverse audio conditions and accents.

Model Architecture

Whisper uses an encoder-decoder architecture (without a joiner):

Encoder (encoder.onnx or encoder.int8.onnx) – Processes audio
Decoder (decoder.onnx or decoder.int8.onnx) – Generates text tokens
Tokens (tokens.txt) – Multilingual token vocabulary

The absence of a joiner component distinguishes Whisper from transducer models.

When to Use

Multilingual Content

Transcribe audio in 90+ languages without language-specific models

Diverse Audio

Robust to accents, background noise, and varying audio quality

Translation

Built-in translation to English (set task: ‘translate’)

Zero-Shot Recognition

Good accuracy without language-specific fine-tuning

Supported Languages

Whisper supports 90+ languages including:

English, Spanish, French, German, Italian, Portuguese
Chinese (Mandarin, Cantonese), Japanese, Korean
Arabic, Russian, Hindi, Bengali
And many more…

Use the SDK’s language helpers to get the full list:

import { getWhisperLanguages } from 'react-native-sherpa-onnx/stt';

const languages = getWhisperLanguages();
console.log(languages[0]); // { id: 'en', name: 'english' }

Performance Characteristics

Aspect	Rating	Notes
Streaming	❌ Not Supported	Offline/batch only (encoder-decoder architecture)
Accuracy	⭐⭐⭐⭐⭐	Excellent multilingual accuracy
Speed	⭐⭐⭐	Slower than CTC/transducer, but acceptable
Memory	⭐⭐⭐	Larger models need significant RAM
Model Size	Large	Tiny: ~40 MB, Base: ~75 MB, Small: ~250 MB, Large: 1+ GB

Download Links

Whisper Models

Browse and download pretrained Whisper models (Tiny, Base, Small, Medium, Large)

Configuration Example

Basic Transcription

import { createSTT } from 'react-native-sherpa-onnx/stt';

const stt = await createSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-whisper-tiny-en'
  },
  modelType: 'whisper', // or 'auto'
  preferInt8: true,
  numThreads: 2,
});

const result = await stt.transcribeFile('/path/to/audio.wav');
console.log('Transcription:', result.text);

await stt.destroy();

With Language Selection

import { createSTT, getWhisperLanguages } from 'react-native-sherpa-onnx/stt';

const stt = await createSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-whisper-base'
  },
  modelType: 'whisper',
  modelOptions: {
    whisper: {
      language: 'de',           // German
      task: 'transcribe',       // or 'translate' for English translation
    }
  },
});

const result = await stt.transcribeFile('/path/to/german-audio.wav');
console.log('Result:', result.text); // German text

Translation to English

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-base' },
  modelType: 'whisper',
  modelOptions: {
    whisper: {
      language: 'fr',        // French source
      task: 'translate',     // Translate to English
    }
  },
});

const result = await stt.transcribeFile('/path/to/french-audio.wav');
console.log('English translation:', result.text);

Model Options

Whisper supports several configuration options via modelOptions.whisper:

Option	Type	Description
`language`	`string`	Language code (e.g. `'en'`, `'de'`, `'zh'`). Use `getWhisperLanguages()` for valid codes. Omit for auto-detection.
`task`	`'transcribe' \| 'translate'`	`'transcribe'` returns text in source language. `'translate'` translates to English.
`tailPaddings`	`number`	Padding at the end of audio (default from model)
`enableTokenTimestamps`	`boolean`	Enable token-level timestamps (Android only)
`enableSegmentTimestamps`	`boolean`	Enable segment timestamps (Android only)

Important: Only use valid language codes from getWhisperLanguages(). Invalid values can crash the app.iOS currently supports only language, task, and tailPaddings. Timestamp options are Android-only.

Language Helpers

import { 
  getWhisperLanguages,
  WHISPER_LANGUAGES 
} from 'react-native-sherpa-onnx/stt';

// Get language list at runtime
const languages = getWhisperLanguages();
// [{ id: 'en', name: 'english' }, { id: 'zh', name: 'chinese' }, ...]

// Build a picker/dropdown
<Picker>
  {languages.map(lang => (
    <Picker.Item 
      key={lang.id} 
      label={lang.name} 
      value={lang.id} 
    />
  ))}
</Picker>

Model Variants

Variant	Size	Speed	Accuracy	Use Case
Tiny	~40 MB	Very Fast	Good	Mobile devices, quick transcription
Base	~75 MB	Fast	Good	Balanced mobile performance
Small	~250 MB	Medium	Very Good	High-quality mobile transcription
Medium	~800 MB	Slow	Excellent	High-end devices, best quality
Large	1+ GB	Very Slow	Best	Server-side, maximum accuracy

For mobile apps, Tiny and Base models are recommended. Use Small on high-end devices.

Model Detection

Whisper models are detected by:

Presence of encoder.onnx + decoder.onnx (no joiner.onnx)
Optional folder name pattern (containing whisper)

Expected files:

encoder.onnx (or encoder.int8.onnx)
decoder.onnx (or decoder.int8.onnx)
tokens.txt

Performance Tips

Use Quantized Models

Int8 quantization significantly reduces size and improves speed:

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-tiny' },
  preferInt8: true, // Use encoder.int8.onnx and decoder.int8.onnx
});

Choose the Right Variant

Balance size, speed, and accuracy:

// For mobile apps
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-tiny' }, // Fast, small
  preferInt8: true,
});

// For high accuracy (high-end devices)
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-small' }, // Better quality
  numThreads: 4,
});

Optimize Thread Count

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-base' },
  numThreads: 4, // More threads for faster inference
});

Streaming Support

Streaming: ❌ Not SupportedWhisper models use an encoder-decoder architecture that processes the entire audio sequence at once. They cannot be used with createStreamingSTT().For real-time recognition, use Transducer, NeMo CTC, or Tone CTC models instead.

Advantages

Multilingual: 90+ languages without separate models
Robust: Handles accents, noise, and varying audio quality
Translation: Built-in translation to English
Zero-Shot: Good accuracy without fine-tuning
Widely Used: Battle-tested, well-documented

Limitations

No Streaming: Cannot be used for real-time recognition
Slower: Encoder-decoder is slower than CTC models
Larger Models: Bigger files and memory footprint
No Hotwords: Does not support contextual biasing

Use Cases

Multilingual Apps

Apps serving users in multiple countries/languages

Content Transcription

Transcribing podcasts, interviews, or videos

Subtitle Generation

Creating subtitles for pre-recorded content

Translation

Translating audio from any language to English

Common Issues

App crashes with invalid language

Use getWhisperLanguages() to get valid language codes
Never use free-text input for language option
Omit language for auto-detection

Cannot use for streaming

Whisper does not support streaming
Use Transducer, NeMo CTC, or Tone CTC for real-time recognition
Use getOnlineTypeOrNull(modelType) to check if a model supports streaming

Slow transcription

Use smaller variants (Tiny or Base)
Enable preferInt8: true for quantized models
Increase numThreads on multi-core devices
Consider using Paraformer or CTC models for faster batch processing

High memory usage

Use Tiny or Base variants instead of Small/Medium/Large
Enable int8 quantization
Ensure no other heavy apps are running

Next Steps

STT API

Detailed API documentation

Model Setup

How to download and bundle models

Transducer Models

For streaming recognition

Execution Providers

Hardware acceleration options

Speech-to-Text Models

Text-to-Speech Models

​Whisper Models

​Model Architecture

​When to Use

Multilingual Content

Diverse Audio

Translation

Zero-Shot Recognition

​Supported Languages

​Performance Characteristics

​Download Links

Whisper Models

​Configuration Example

​Basic Transcription

​With Language Selection

​Translation to English

​Model Options

​Language Helpers

​Model Variants

​Model Detection

​Performance Tips

​Use Quantized Models

​Choose the Right Variant

​Optimize Thread Count

​Streaming Support

​Advantages

​Limitations

​Use Cases

Multilingual Apps

Content Transcription

Subtitle Generation

Translation

​Common Issues

​Next Steps

STT API

Model Setup

Transducer Models

Execution Providers

Build docs developers (and LLMs) love

Whisper Models

Model Architecture

When to Use

Supported Languages

Performance Characteristics

Download Links

Configuration Example

Basic Transcription

With Language Selection

Translation to English

Model Options

Language Helpers

Model Variants

Model Detection

Performance Tips

Use Quantized Models

Choose the Right Variant

Optimize Thread Count

Streaming Support

Advantages

Limitations

Use Cases

Common Issues

Next Steps