Zipformer & Transducer Models

Zipformer and Transducer models use an encoder-decoder-joiner architecture (also known as RNN-T) and provide the best balance of speed and accuracy for streaming speech recognition.

Model Architecture

Transducer models consist of three components:

Encoder (encoder.onnx) – Processes audio features
Decoder (decoder.onnx) – Language model component
Joiner (joiner.onnx) – Combines encoder and decoder outputs
Tokens (tokens.txt) – Token vocabulary

This architecture enables streaming recognition with low latency, making it ideal for real-time applications.

Variants

Zipformer (Standard)

Modern transformer-based transducer models:

Excellent accuracy
Fast inference
Streaming capable
Lower memory usage than LSTM variants

LSTM Transducer

LSTM-based transducer models:

Same encoder-decoder-joiner layout
Good for streaming ASR
Detected automatically as transducer type
May have lower memory footprint

When to Use

Real-Time Recognition

Live transcription from microphone with low latency and partial results

Voice Assistants

Interactive voice interfaces with fast response times

Live Captions

Real-time subtitle generation for videos or meetings

Contextual Biasing

Supports hotwords for domain-specific vocabulary (see below)

Supported Languages

Available in many languages including:

English (multiple variants)
Chinese (Mandarin, Cantonese)
German, French, Spanish
Russian, Japanese, Korean
And many more

Check the download page for the full list.

Performance Characteristics

Aspect	Rating	Notes
Streaming	✅ Excellent	Native streaming support with low latency
Accuracy	⭐⭐⭐⭐⭐	Very high accuracy, especially for trained languages
Speed	⭐⭐⭐⭐⭐	Fast inference, real-time capable
Memory	⭐⭐⭐⭐	Moderate memory usage, int8 models available
Model Size	Medium	Typically 50-150 MB (varies by language/variant)

Download Links

Zipformer Models

Browse and download pretrained Zipformer models

LSTM Transducer Models

Browse and download LSTM transducer models

Configuration Example

Offline Transcription

import { createSTT } from 'react-native-sherpa-onnx/stt';

const stt = await createSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-zipformer-en-2023-04-01'
  },
  modelType: 'transducer', // or 'auto'
  preferInt8: true,
  numThreads: 2,
});

const result = await stt.transcribeFile('/path/to/audio.wav');
console.log('Transcription:', result.text);

await stt.destroy();

Streaming Recognition

import { createStreamingSTT } from 'react-native-sherpa-onnx/stt';

const engine = await createStreamingSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-streaming-zipformer-en-2023-06-26'
  },
  modelType: 'transducer',
  enableEndpoint: true,
  numThreads: 2,
});

const stream = await engine.createStream();

// Feed audio chunks
const samples = getPcmSamplesFromMic(); // float[] in [-1, 1]
const { result, isEndpoint } = await stream.processAudioChunk(samples, 16000);

console.log('Partial result:', result.text);
if (isEndpoint) {
  console.log('Utterance ended');
}

await stream.release();
await engine.destroy();

Hotwords Support

Transducer models are the only model type that supports hotwords (contextual biasing) for boosting domain-specific vocabulary:

const stt = await createSTT({
  modelPath: {
    type: 'asset',
    path: 'models/sherpa-onnx-zipformer-en-2023-04-01'
  },
  modelType: 'transducer',
  hotwordsFile: '/path/to/hotwords.txt',
  hotwordsScore: 1.5, // Boost factor
});

Hotwords file format (hotwords.txt):

REACT NATIVE 2.5
SHERPA ONNX 3.0
TURBOMODULE 2.0

See the Hotwords Guide for more details.

Runtime Configuration

You can update recognition parameters at runtime:

await stt.setConfig({
  decodingMethod: 'modified_beam_search',
  maxActivePaths: 4,
  hotwordsScore: 2.0, // Adjust hotword boost
});

Model Detection

Folder name should contain zipformer or transducer for auto-detection. LSTM models may contain lstm in the folder name. Expected files:

encoder.onnx (or encoder.int8.onnx)
decoder.onnx (or decoder.int8.onnx)
joiner.onnx (or joiner.int8.onnx)
tokens.txt

Performance Tips

Use Quantized Models

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/zipformer-en' },
  preferInt8: true, // Prefer int8 quantized variants
});

Int8 models are typically:

3-4x smaller
2-3x faster
Minimal accuracy loss

Optimize Thread Count

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/zipformer-en' },
  numThreads: 4, // Adjust based on device cores
});

Use Hardware Acceleration

const stt = await createSTT({
  modelPath: { type: 'asset', path: 'models/zipformer-en' },
  provider: 'nnapi', // Android NNAPI
});

Streaming Support

Streaming: ✅ YesTransducer models have native streaming support. Use createStreamingSTT() for real-time recognition.

For streaming recognition, see the Streaming STT Guide.

Common Issues

Model not loading

Ensure all three files are present: encoder.onnx, decoder.onnx, joiner.onnx
Check that tokens.txt exists
Verify folder name contains zipformer or transducer for auto-detection

Hotwords not working

Verify modelType is 'transducer' (hotwords only work with transducer models)
Check hotwords file format (one phrase per line, optional boost value)
Use sttSupportsHotwords(modelType) to verify compatibility

Poor streaming performance

Increase numThreads if device has multiple cores
Use preferInt8: true for int8 quantized models
Enable hardware acceleration with provider: 'nnapi' or provider: 'xnnpack'

Next Steps

Streaming STT

Learn about real-time recognition

Hotwords

Boost domain-specific vocabulary

Model Setup

How to download and bundle models

Execution Providers

Hardware acceleration options

Speech-to-Text Models

Text-to-Speech Models

Zipformer & Transducer Models

Zipformer & Transducer Models

Model Architecture

Variants

Zipformer (Standard)

LSTM Transducer

When to Use

Real-Time Recognition

Voice Assistants

Live Captions

Contextual Biasing

Supported Languages

Performance Characteristics

Download Links

Zipformer Models

LSTM Transducer Models

Configuration Example

Offline Transcription

Streaming Recognition

Hotwords Support

Runtime Configuration

Model Detection

Performance Tips

Use Quantized Models

Optimize Thread Count

Use Hardware Acceleration

Streaming Support

Common Issues

Next Steps

Streaming STT

Hotwords

Model Setup

Execution Providers

Build docs developers (and LLMs) love

Speech-to-Text Models

Text-to-Speech Models

​Zipformer & Transducer Models

​Model Architecture

​Variants

​Zipformer (Standard)

​LSTM Transducer

​When to Use

Real-Time Recognition

Voice Assistants

Live Captions

Contextual Biasing

​Supported Languages

​Performance Characteristics

​Download Links

Zipformer Models

LSTM Transducer Models

​Configuration Example

​Offline Transcription

​Streaming Recognition

​Hotwords Support

​Runtime Configuration

​Model Detection

​Performance Tips

​Use Quantized Models

​Optimize Thread Count

​Use Hardware Acceleration

​Streaming Support

​Common Issues

​Next Steps

Streaming STT

Hotwords

Model Setup

Execution Providers

Build docs developers (and LLMs) love

Zipformer & Transducer Models

Model Architecture

Variants

Zipformer (Standard)

LSTM Transducer

When to Use

Supported Languages

Performance Characteristics

Download Links

Configuration Example

Offline Transcription

Streaming Recognition

Hotwords Support

Runtime Configuration

Model Detection

Performance Tips

Use Quantized Models

Optimize Thread Count

Use Hardware Acceleration

Streaming Support

Common Issues

Next Steps