Skip to main content

Paraformer Models

Paraformer is a non-autoregressive speech recognition model that offers excellent speed and accuracy for both offline and streaming use cases.

Model Architecture

Paraformer uses a single-model architecture:
  • Model (model.onnx or model.int8.onnx) – Single neural network
  • Tokens (tokens.txt) – Token vocabulary
Unlike transducer models, Paraformer doesn’t require separate encoder, decoder, and joiner components, making it simpler to deploy.

When to Use

Fast Batch Processing

Excellent for transcribing multiple audio files quickly

Chinese Speech

Outstanding accuracy for Mandarin Chinese

Streaming Recognition

Supports streaming mode for real-time transcription

Resource-Constrained Devices

Single-model architecture uses less memory

Supported Languages

Paraformer models are primarily available for:
  • Chinese (Mandarin) – Excellent accuracy, widely used
  • English – Some bilingual variants available
  • Chinese + English – Bilingual models
Paraformer models excel at Chinese speech recognition and are widely adopted in Chinese applications.

Performance Characteristics

AspectRatingNotes
Streaming✅ SupportedStreaming-capable with good latency
Accuracy⭐⭐⭐⭐⭐Very high accuracy, especially for Chinese
Speed⭐⭐⭐⭐⭐Fast non-autoregressive inference
Memory⭐⭐⭐⭐⭐Low memory usage (single model)
Model SizeSmall-MediumTypically 50-200 MB depending on variant

Paraformer Models

Browse and download pretrained Paraformer models

Configuration Example

Offline Transcription

import { createSTT } from 'react-native-sherpa-onnx/stt';

const stt = await createSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-paraformer-zh-2023-03-28'
  },
  modelType: 'paraformer', // or 'auto'
  preferInt8: true,
  numThreads: 2,
});

const result = await stt.transcribeFile('/path/to/audio.wav');
console.log('Transcription:', result.text);
console.log('Tokens:', result.tokens);

await stt.destroy();

Transcribe from Samples

const samples = getPcmSamples(); // float[] in [-1, 1]
const result = await stt.transcribeSamples(samples, 16000);
console.log('Result:', result.text);

Streaming Recognition

import { createStreamingSTT } from 'react-native-sherpa-onnx/stt';

const engine = await createStreamingSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-streaming-paraformer-zh'
  },
  modelType: 'paraformer',
  enableEndpoint: true,
});

const stream = await engine.createStream();

// Process audio chunks
const { result, isEndpoint } = await stream.processAudioChunk(samples, 16000);
console.log('Partial:', result.text);

await stream.release();
await engine.destroy();

Model Detection

Paraformer models are detected automatically by the presence of model.onnx (or model.int8.onnx) and tokens.txt. No folder name pattern is required. Expected files:
  • model.onnx (or model.int8.onnx)
  • tokens.txt

Performance Tips

Use Quantized Models

Int8 quantized Paraformer models offer excellent speed:
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/paraformer-zh' },
  preferInt8: true, // Automatically use model.int8.onnx if available
});

Optimize for Batch Processing

For transcribing multiple files:
const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/paraformer-zh' },
  numThreads: 4, // Use more threads for faster batch processing
});

const files = ['audio1.wav', 'audio2.wav', 'audio3.wav'];
const results = await Promise.all(
  files.map(file => stt.transcribeFile(file))
);

Hardware Acceleration

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/paraformer-zh' },
  provider: 'nnapi', // Android NNAPI
  // provider: 'xnnpack', // XNNPACK for broader compatibility
});

Streaming Support

Streaming: ✅ YesParaformer supports streaming recognition. Use createStreamingSTT() with modelType: 'paraformer' for real-time transcription.

Advantages

  1. Fast Inference: Non-autoregressive decoding is faster than autoregressive models
  2. Simple Deployment: Single model file, no separate encoder/decoder/joiner
  3. Excellent for Chinese: State-of-the-art accuracy for Mandarin
  4. Low Memory: Single-model architecture uses less RAM
  5. Streaming Capable: Supports real-time recognition

Limitations

  1. Language Coverage: Primarily Chinese-focused, fewer English-only variants
  2. No Hotwords: Does not support contextual biasing (use transducer models for hotwords)
  3. Domain-Specific: Best suited for general Chinese speech (not specialized domains without fine-tuning)

Use Cases

Chinese Transcription

Transcribing Chinese audio files, podcasts, or videos

Real-Time Subtitles

Live Chinese captions for streaming or conferencing

Voice Input

Chinese voice input for apps and forms

Batch Processing

Transcribing large collections of Chinese audio

Common Issues

  • Verify model.onnx and tokens.txt are present
  • Check that the model path is correct
  • Ensure sufficient device memory
  • Paraformer models are optimized for Chinese
  • Use Whisper or transducer models for other languages
  • Check if you’re using a bilingual (Chinese+English) variant
  • Enable preferInt8: true for quantized models
  • Increase numThreads on multi-core devices
  • Use hardware acceleration (provider: 'nnapi')

Comparison with Other Models

FeatureParaformerTransducerWhisper
SpeedVery FastFastMedium
Chinese AccuracyExcellentGoodGood
StreamingYesYesNo
HotwordsNoYesNo
MultilingualLimitedVariesExcellent
Model SizeSmallMediumLarge

Next Steps

STT API

Detailed API documentation

Streaming STT

Real-time recognition guide

Model Setup

How to download and bundle models

Execution Providers

Hardware acceleration options

Build docs developers (and LLMs) love