Skip to main content

Whisper Models

Whisper is OpenAI’s multilingual speech recognition model with excellent zero-shot performance across 90+ languages. It’s robust to diverse audio conditions and accents.

Model Architecture

Whisper uses an encoder-decoder architecture (without a joiner):
  • Encoder (encoder.onnx or encoder.int8.onnx) – Processes audio
  • Decoder (decoder.onnx or decoder.int8.onnx) – Generates text tokens
  • Tokens (tokens.txt) – Multilingual token vocabulary
The absence of a joiner component distinguishes Whisper from transducer models.

When to Use

Multilingual Content

Transcribe audio in 90+ languages without language-specific models

Diverse Audio

Robust to accents, background noise, and varying audio quality

Translation

Built-in translation to English (set task: ‘translate’)

Zero-Shot Recognition

Good accuracy without language-specific fine-tuning

Supported Languages

Whisper supports 90+ languages including:
  • English, Spanish, French, German, Italian, Portuguese
  • Chinese (Mandarin, Cantonese), Japanese, Korean
  • Arabic, Russian, Hindi, Bengali
  • And many more…
Use the SDK’s language helpers to get the full list:
import { getWhisperLanguages } from 'react-native-sherpa-onnx/stt';

const languages = getWhisperLanguages();
console.log(languages[0]); // { id: 'en', name: 'english' }

Performance Characteristics

AspectRatingNotes
Streaming❌ Not SupportedOffline/batch only (encoder-decoder architecture)
Accuracy⭐⭐⭐⭐⭐Excellent multilingual accuracy
Speed⭐⭐⭐Slower than CTC/transducer, but acceptable
Memory⭐⭐⭐Larger models need significant RAM
Model SizeLargeTiny: ~40 MB, Base: ~75 MB, Small: ~250 MB, Large: 1+ GB

Whisper Models

Browse and download pretrained Whisper models (Tiny, Base, Small, Medium, Large)

Configuration Example

Basic Transcription

import { createSTT } from 'react-native-sherpa-onnx/stt';

const stt = await createSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-whisper-tiny-en'
  },
  modelType: 'whisper', // or 'auto'
  preferInt8: true,
  numThreads: 2,
});

const result = await stt.transcribeFile('/path/to/audio.wav');
console.log('Transcription:', result.text);

await stt.destroy();

With Language Selection

import { createSTT, getWhisperLanguages } from 'react-native-sherpa-onnx/stt';

const stt = await createSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-whisper-base'
  },
  modelType: 'whisper',
  modelOptions: {
    whisper: {
      language: 'de',           // German
      task: 'transcribe',       // or 'translate' for English translation
    }
  },
});

const result = await stt.transcribeFile('/path/to/german-audio.wav');
console.log('Result:', result.text); // German text

Translation to English

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-base' },
  modelType: 'whisper',
  modelOptions: {
    whisper: {
      language: 'fr',        // French source
      task: 'translate',     // Translate to English
    }
  },
});

const result = await stt.transcribeFile('/path/to/french-audio.wav');
console.log('English translation:', result.text);

Model Options

Whisper supports several configuration options via modelOptions.whisper:
OptionTypeDescription
languagestringLanguage code (e.g. 'en', 'de', 'zh'). Use getWhisperLanguages() for valid codes. Omit for auto-detection.
task'transcribe' | 'translate''transcribe' returns text in source language. 'translate' translates to English.
tailPaddingsnumberPadding at the end of audio (default from model)
enableTokenTimestampsbooleanEnable token-level timestamps (Android only)
enableSegmentTimestampsbooleanEnable segment timestamps (Android only)
Important: Only use valid language codes from getWhisperLanguages(). Invalid values can crash the app.iOS currently supports only language, task, and tailPaddings. Timestamp options are Android-only.

Language Helpers

import { 
  getWhisperLanguages,
  WHISPER_LANGUAGES 
} from 'react-native-sherpa-onnx/stt';

// Get language list at runtime
const languages = getWhisperLanguages();
// [{ id: 'en', name: 'english' }, { id: 'zh', name: 'chinese' }, ...]

// Build a picker/dropdown
<Picker>
  {languages.map(lang => (
    <Picker.Item 
      key={lang.id} 
      label={lang.name} 
      value={lang.id} 
    />
  ))}
</Picker>

Model Variants

VariantSizeSpeedAccuracyUse Case
Tiny~40 MBVery FastGoodMobile devices, quick transcription
Base~75 MBFastGoodBalanced mobile performance
Small~250 MBMediumVery GoodHigh-quality mobile transcription
Medium~800 MBSlowExcellentHigh-end devices, best quality
Large1+ GBVery SlowBestServer-side, maximum accuracy
For mobile apps, Tiny and Base models are recommended. Use Small on high-end devices.

Model Detection

Whisper models are detected by:
  • Presence of encoder.onnx + decoder.onnx (no joiner.onnx)
  • Optional folder name pattern (containing whisper)
Expected files:
  • encoder.onnx (or encoder.int8.onnx)
  • decoder.onnx (or decoder.int8.onnx)
  • tokens.txt

Performance Tips

Use Quantized Models

Int8 quantization significantly reduces size and improves speed:
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-tiny' },
  preferInt8: true, // Use encoder.int8.onnx and decoder.int8.onnx
});

Choose the Right Variant

Balance size, speed, and accuracy:
// For mobile apps
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-tiny' }, // Fast, small
  preferInt8: true,
});

// For high accuracy (high-end devices)
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-small' }, // Better quality
  numThreads: 4,
});

Optimize Thread Count

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-base' },
  numThreads: 4, // More threads for faster inference
});

Streaming Support

Streaming: ❌ Not SupportedWhisper models use an encoder-decoder architecture that processes the entire audio sequence at once. They cannot be used with createStreamingSTT().For real-time recognition, use Transducer, NeMo CTC, or Tone CTC models instead.

Advantages

  1. Multilingual: 90+ languages without separate models
  2. Robust: Handles accents, noise, and varying audio quality
  3. Translation: Built-in translation to English
  4. Zero-Shot: Good accuracy without fine-tuning
  5. Widely Used: Battle-tested, well-documented

Limitations

  1. No Streaming: Cannot be used for real-time recognition
  2. Slower: Encoder-decoder is slower than CTC models
  3. Larger Models: Bigger files and memory footprint
  4. No Hotwords: Does not support contextual biasing

Use Cases

Multilingual Apps

Apps serving users in multiple countries/languages

Content Transcription

Transcribing podcasts, interviews, or videos

Subtitle Generation

Creating subtitles for pre-recorded content

Translation

Translating audio from any language to English

Common Issues

  • Use getWhisperLanguages() to get valid language codes
  • Never use free-text input for language option
  • Omit language for auto-detection
  • Whisper does not support streaming
  • Use Transducer, NeMo CTC, or Tone CTC for real-time recognition
  • Use getOnlineTypeOrNull(modelType) to check if a model supports streaming
  • Use smaller variants (Tiny or Base)
  • Enable preferInt8: true for quantized models
  • Increase numThreads on multi-core devices
  • Consider using Paraformer or CTC models for faster batch processing
  • Use Tiny or Base variants instead of Small/Medium/Large
  • Enable int8 quantization
  • Ensure no other heavy apps are running

Next Steps

STT API

Detailed API documentation

Model Setup

How to download and bundle models

Transducer Models

For streaming recognition

Execution Providers

Hardware acceleration options

Build docs developers (and LLMs) love