This guide covers techniques to improve inference performance for speech recognition (STT) and text-to-speech (TTS) in React Native Sherpa-ONNX.
Model Quantization
Choose quantized models
Quantized models (e.g., INT8, INT16) offer faster inference and smaller memory footprint compared to full-precision (FP32) models:
- INT8 models: Best balance of speed and quality for most mobile devices
- FP16 models: Good performance on devices with FP16 hardware support
- FP32 models: Highest quality but slower and larger
For devices with less than 8 GB RAM, prefer INT8 quantized models. For example, use sherpa-onnx-zipvoice-distill-int8-zh-en-emilia (~104 MB) instead of the full FP32 Zipvoice model (~605 MB).
Model size vs. quality tradeoff
Smaller models are faster but may have slightly lower accuracy:
// Example: Choosing between model variants
const modelConfig = {
// For high-end devices (8GB+ RAM)
highEnd: 'sherpa-onnx-zipvoice-zh-en-emilia', // ~605 MB, best quality
// For mid-range devices (4-8GB RAM)
midRange: 'sherpa-onnx-streaming-zipformer-en-2023-06-26', // ~67 MB
// For low-end devices (<4GB RAM)
lowEnd: 'sherpa-onnx-zipvoice-distill-int8-zh-en-emilia', // ~104 MB
};
Execution Providers
Hardware acceleration
Use execution providers to leverage dedicated hardware accelerators:
| Provider | Platform | Hardware | Use Case |
|---|
| QNN | Android | Qualcomm NPU | Best performance on Qualcomm devices |
| NNAPI | Android | GPU/DSP/NPU | General Android hardware acceleration |
| XNNPACK | Android/iOS | CPU-optimized | CPU inference optimization |
| Core ML | iOS | Apple Neural Engine | Best performance on iOS devices |
Check provider availability
Always check if a provider is available before using it:
import { getQnnSupport, getNnapiSupport, getXnnpackSupport } from 'react-native-sherpa-onnx';
// Check QNN support
const qnnSupport = await getQnnSupport();
if (qnnSupport.canInit) {
// Use QNN provider
const recognizer = await createStreamingStt({
modelPath: { type: 'asset', path: 'models/my-model' },
provider: 'qnn',
});
}
// Fallback to NNAPI or CPU
const nnapiSupport = await getNnapiSupport();
const provider = nnapiSupport.canInit ? 'nnapi' : 'cpu';
Provider selection strategy
async function selectBestProvider(): Promise<'qnn' | 'nnapi' | 'xnnpack' | 'cpu'> {
// Try QNN first (best for Qualcomm)
const qnn = await getQnnSupport();
if (qnn.canInit) return 'qnn';
// Try NNAPI (GPU/DSP/NPU)
const nnapi = await getNnapiSupport();
if (nnapi.canInit) return 'nnapi';
// Try XNNPACK (CPU-optimized)
const xnnpack = await getXnnpackSupport();
if (xnnpack.canInit) return 'xnnpack';
// Fallback to CPU
return 'cpu';
}
const provider = await selectBestProvider();
const recognizer = await createStreamingStt({
modelPath: { type: 'asset', path: 'models/my-model' },
provider,
});
See the Execution Providers guide for detailed information on adding QNN runtime libraries and checking provider support.
Threading
Adjust the number of threads based on device capabilities:
import { createStreamingStt } from 'react-native-sherpa-onnx/stt';
const recognizer = await createStreamingStt({
modelPath: { type: 'asset', path: 'models/my-model' },
numThreads: 4, // Adjust based on device cores
});
Thread count guidelines
- Low-end devices (2-4 cores):
numThreads: 2
- Mid-range devices (4-6 cores):
numThreads: 4
- High-end devices (8+ cores):
numThreads: 4-6
Setting too many threads can hurt performance due to overhead. Start with 2-4 threads and test on target devices.
Memory Management
Check available memory
The SDK automatically checks memory before loading large models (e.g., Zipvoice):
try {
const tts = await createTTS({
modelPath: { type: 'asset', path: 'models/zipvoice-full' },
});
} catch (error) {
if (error.message.includes('Not enough free memory')) {
// Fallback to smaller model
const tts = await createTTS({
modelPath: { type: 'asset', path: 'models/zipvoice-int8' },
});
}
}
Release resources
Release recognizers and TTS instances when not needed:
const recognizer = await createStreamingStt({ /* ... */ });
// Use recognizer...
// Clean up when done
await recognizer.release();
Batch Processing
Process audio in optimal chunks
For streaming STT, use appropriate buffer sizes:
// Optimal chunk size: 0.1-0.2 seconds of audio
const sampleRate = 16000;
const chunkDuration = 0.1; // 100ms
const chunkSize = sampleRate * chunkDuration; // 1600 samples
// Process audio in chunks
for (let i = 0; i < audioData.length; i += chunkSize) {
const chunk = audioData.slice(i, i + chunkSize);
recognizer.acceptWaveform(chunk, sampleRate);
}
Warm-up Inference
Pre-initialize models
Run a warm-up inference to initialize the model before actual use:
const recognizer = await createStreamingStt({ /* ... */ });
// Warm-up with silent audio
const silentAudio = new Float32Array(1600).fill(0);
recognizer.acceptWaveform(silentAudio, 16000);
recognizer.getResult(); // Discard result
// Now ready for real inference
Android
-
Use QNN on Qualcomm devices:
const qnn = await getQnnSupport();
const provider = qnn.canInit ? 'qnn' : 'cpu';
-
Enable NNAPI for other devices:
const nnapi = await getNnapiSupport();
const provider = nnapi.canInit ? 'nnapi' : 'cpu';
-
Use bundled models to avoid file I/O overhead:
modelPath: { type: 'asset', path: 'models/my-model' }
iOS
-
Use Core ML when available:
const coreml = await getCoreMlSupport();
if (coreml.hasAccelerator) {
// Apple Neural Engine available
const recognizer = await createStreamingStt({
modelPath: { type: 'asset', path: 'models/my-model' },
provider: 'coreml',
});
}
-
Prefer XCFramework for faster loading times
Measure inference time
const start = Date.now();
await recognizer.acceptWaveform(audioData, sampleRate);
const result = await recognizer.getResult();
const duration = Date.now() - start;
console.log(`Inference took ${duration}ms`);
Track real-time factor (RTF)
const audioLengthMs = (audioData.length / sampleRate) * 1000;
const rtf = duration / audioLengthMs;
if (rtf < 1.0) {
console.log(`Real-time performance achieved (RTF: ${rtf.toFixed(2)})`);
} else {
console.log(`Slower than real-time (RTF: ${rtf.toFixed(2)})`);
}
Best Practices Summary
Choose the right model
Select quantized models (INT8) for mobile devices and balance size vs. quality
Enable hardware acceleration
Use execution providers (QNN, NNAPI, Core ML) based on platform and device
Optimize threading
Set appropriate thread count (2-4 for most devices)
Manage memory
Release resources when done and use smaller models on low-RAM devices
Test on target devices
Always measure performance on actual target devices