Skip to main content

Speech-to-Text Models Overview

react-native-sherpa-onnx supports a wide range of speech-to-text (STT) model architectures, from fast streaming transducers to multilingual models like Whisper. This guide helps you choose the right model for your use case.

Model Comparison

Zipformer/Transducer

Fast streaming recognition with excellent accuracy. Best for real-time use cases.

Paraformer

Non-autoregressive ASR. Fast batch processing with high accuracy.

Whisper

Multilingual, robust zero-shot recognition. Strong for diverse audio conditions.

NeMo CTC

Excellent for English and streaming. Good balance of speed and accuracy.

Other Models

WeNet, SenseVoice, FunASR, Moonshine, and specialized models.

Quick Comparison Table

Model TypeStreamingMultilingualSpeedUse Case
Zipformer/Transducer✅ YesDependsFastReal-time recognition, voice assistants
LSTM Transducer✅ YesDependsFastStreaming ASR, mobile apps
Paraformer✅ YesLimitedVery FastFast batch transcription
Whisper❌ No✅ Yes (90+ langs)MediumMultilingual transcription, diverse audio
NeMo CTC✅ YesLimitedFastEnglish streaming, live captions
WeNet CTC❌ NoLimitedFastCompact deployment
SenseVoice❌ No✅ YesMediumEmotion detection, punctuation
FunASR Nano❌ NoLimitedMediumLLM-based ASR with prompts
Moonshine✅ Yes (v1 & v2)LimitedFastStreaming-capable lightweight ASR
Fire Red ASR❌ NoLimitedMediumEncoder-decoder ASR
Dolphin❌ NoLimitedFastSingle-model CTC
Canary❌ No✅ YesMediumMultilingual NeMo model
Omnilingual❌ No✅ YesMediumWide language coverage
Tone CTC✅ YesLimitedVery FastLightweight streaming CTC

Choosing a Model

For Real-Time Recognition (Streaming)

If you need live recognition from a microphone, choose one of these streaming-capable models:
  • Zipformer/Transducer – Best overall for streaming, excellent accuracy
  • NeMo CTC – Great for English streaming applications
  • Tone CTC – Lightweight option for resource-constrained devices
  • LSTM Transducer – LSTM-based streaming alternative
  • Paraformer – Fast streaming with non-autoregressive approach
  • Moonshine – Modern streaming-capable architecture

For Batch/Offline Transcription

If you’re transcribing pre-recorded audio files:
  • Whisper – Best for multilingual content, robust to noise
  • Paraformer – Fastest for single-language batch processing
  • SenseVoice – When you need emotion labels and punctuation
  • Canary – Multilingual with good accuracy

By Language Support

English Only:
  • NeMo CTC (streaming)
  • Tone CTC (streaming)
  • Many Zipformer variants
Multilingual (90+ languages):
  • Whisper (offline)
  • Canary (offline)
  • Omnilingual (offline)
  • SenseVoice (5 languages + emotion)
Chinese:
  • Paraformer (excellent for Mandarin)
  • FunASR Nano (LLM-based with prompts)
  • SenseVoice (Chinese + emotion)

By Device Constraints

Low-end devices / limited RAM:
  • Tone CTC (lightweight streaming)
  • Dolphin (compact single-model)
  • WeNet CTC (compact deployment)
  • Use int8 quantized variants when available
High-end devices:
  • Whisper (large models)
  • Canary (multilingual)
  • Full Zipformer models

Model Detection

The SDK automatically detects model types based on folder name patterns and file layouts. You can also force a specific type:
import { createSTT, detectSttModel } from 'react-native-sherpa-onnx/stt';

// Auto-detect model type
const detectedInfo = await detectSttModel({
  type: 'asset',
  path: 'models/sherpa-onnx-whisper-tiny-en'
});
console.log(detectedInfo.modelType); // 'whisper'

// Create STT with auto-detection
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/sherpa-onnx-whisper-tiny-en' },
  modelType: 'auto', // Auto-detect
  preferInt8: true,
});

Performance Tips

Use Quantized Models

Set preferInt8: true to automatically use int8 quantized models when available:
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/whisper-tiny' },
  preferInt8: true, // Faster inference, smaller memory footprint
});

Adjust Thread Count

Increase threads on multi-core devices:
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/zipformer' },
  numThreads: 4, // Use multiple cores
});

Use Execution Providers

Leverage hardware acceleration:
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/paraformer' },
  provider: 'nnapi', // Android NNAPI
  // provider: 'xnnpack', // For XNNPACK
  // provider: 'qnn',     // Qualcomm QNN
});
See the Execution Providers guide for more details. All model downloads are available from the sherpa-onnx pretrained models repository:

Next Steps

Model Setup Guide

Learn how to download and bundle models with your app

STT API Reference

Detailed API documentation for speech recognition

Streaming STT

Real-time recognition from microphone

Hotwords

Contextual biasing for improved accuracy

Build docs developers (and LLMs) love