Skip to main content

Overview

Optimizing whisper.rn involves balancing accuracy, speed, and resource usage. This guide covers model selection, threading configuration, GPU acceleration, and platform-specific optimizations.

Model Selection

The most impactful optimization is choosing the right model size.

Model Comparison

ModelSizeSpeedAccuracyRAM UsageBest For
tiny.en75 MBFastestGood~100 MBLive transcription, quick results
tiny75 MBFastestGood~100 MBMulti-language, real-time
base.en142 MBFastBetter~150 MBBalanced performance
base142 MBFastBetter~150 MBMulti-language, balanced
small.en466 MBMediumGreat~500 MBHigh accuracy needed
small466 MBMediumGreat~500 MBMulti-language, accurate
medium.en1.5 GBSlowExcellent~1.8 GBOffline processing
large3 GBSlowestBest~3.5 GBMaximum accuracy (not recommended mobile)

Quantized Models

Quantized models reduce size and improve speed with minimal accuracy loss.
tiny.en-q8.bin     - 8-bit quantized (recommended)
base.en-q5.bin     - 5-bit quantized (more aggressive)
small.en-q8.bin    - 8-bit quantized
Recommendations:
  • Mobile realtime: tiny.en or tiny.en-q8
  • Mobile batch: base.en or base.en-q8
  • High accuracy: small.en or small.en-q8
  • Avoid on mobile: medium, large (too slow, memory intensive)

Threading Configuration

maxThreads

Controls CPU parallelization for transcription.
NativeRNWhisper.ts
type TranscribeOptions = {
  maxThreads?: number  // Default: 2 for 4-core, 4 for 6+ core devices
  // ...
}
maxThreads
number
Number of CPU threads to use during computationDefault: Automatically detected
  • 2 threads for 4-core devices
  • 4 threads for 6+ core devices
Range: 1-8 (device-dependent)

Finding Optimal Thread Count

Use the bench() method to test different thread counts:
const whisperContext = await initWhisper({ filePath: modelPath })

// Test with different thread counts
for (let threads = 1; threads <= 8; threads++) {
  const result = await whisperContext.bench(threads)
  console.log(`Threads: ${threads}`)
  console.log(`Encode: ${result.encodeMs}ms`)
  console.log(`Decode: ${result.decodeMs}ms`)
  console.log(`Total: ${result.encodeMs + result.decodeMs}ms`)
  console.log('---')
}
BenchResult
object
Typical Results:
  • 1 thread: Slowest, most battery efficient
  • 2 threads: Good balance for most devices
  • 4 threads: Fast on modern devices (6+ cores)
  • 6+ threads: Diminishing returns, more heat/battery drain

Best Practices

const transcribeOptions = {
  maxThreads: 2,  // Conservative for battery life
}

// For real-time (prioritize speed)
const realtimeOptions = {
  maxThreads: 4,  // More threads for responsiveness
}

// For background processing (prioritize efficiency)
const backgroundOptions = {
  maxThreads: 1,  // Minimal resource usage
}

GPU Acceleration

iOS Metal Acceleration

iOS devices support Metal GPU acceleration for significant speed improvements.
const whisperContext = await initWhisper({
  filePath: modelPath,
  useGpu: true,  // Enable Metal (default: true)
})

console.log('GPU active:', whisperContext.gpu)
console.log('Reason if not:', whisperContext.reasonNoGPU)
useGpu
boolean
default:"true"
Enable Metal GPU acceleration on iOSRequirements:
  • iOS device (not simulator)
  • Metal-compatible device (iPhone 5s+, iPad Air+)
  • Model supports GPU acceleration
Performance Impact:
  • 2-4x faster encoding on compatible devices
  • Reduced CPU usage
  • Lower battery consumption for sustained transcription
Check GPU Status:
if (!whisperContext.gpu) {
  console.warn('GPU not available:', whisperContext.reasonNoGPU)
  // Reasons: 'Simulator', 'Metal not supported', 'Build flag disabled'
}

Disabling Metal (Build-time)

To reduce binary size or avoid Metal dependency:
Podfile
ENV['RNWHISPER_DISABLE_METAL'] = '1'
pod install

iOS Core ML Acceleration

Core ML accelerates the encoder on iOS 15.0+.
const whisperContext = await initWhisper({
  filePath: 'ggml-base.en.bin',
  useCoreMLIos: true,  // Enable Core ML (default: true)
  coreMLModelAsset: {
    filename: 'ggml-base.en-encoder.mlmodelc',
    assets: [
      require('./models/ggml-base.en-encoder.mlmodelc/weights/weight.bin'),
      require('./models/ggml-base.en-encoder.mlmodelc/model.mil'),
      require('./models/ggml-base.en-encoder.mlmodelc/coremldata.bin'),
    ],
  },
})
useCoreMLIos
boolean
default:"true"
Enable Core ML encoder accelerationRequirements:
  • iOS 15.0+
  • Core ML model files (.mlmodelc directory)
  • Model files co-located with GGML model
Core ML vs Metal:
  • Core ML: Encoder only, requires model files
  • Metal: Full acceleration, no extra files needed
  • Recommendation: Use Metal (useGpu: true) for simplicity
Disabling Core ML:
Podfile
ENV['RNWHISPER_DISABLE_COREML'] = '1'
pod install

Flash Attention (Experimental)

const whisperContext = await initWhisper({
  filePath: modelPath,
  useGpu: true,
  useFlashAttn: true,  // Experimental optimization
})
useFlashAttn
boolean
default:"false"
Enable Flash Attention optimizationRecommended: Only when GPU is availableBenefits: Faster attention computation, lower memory usageTrade-offs: Slight accuracy impact, experimental

Transcription Options Tuning

Language Detection

// Auto-detect (slower, 1-2 second overhead)
const autoResult = await whisperContext.transcribe(audioPath, {
  language: 'auto',
})

// Specify language (faster, no detection overhead)
const enResult = await whisperContext.transcribe(audioPath, {
  language: 'en',  // English
})
Recommendation: Always specify language if known (saves 1-2 seconds).
const options = {
  beamSize: 1,   // Greedy decoding (fast, default)
  // beamSize: 5,   // Beam search (slower, more accurate)
}
beamSize
number
default:"1"
Beam size for beam search decoding
  • 1: Greedy decoding (fastest)
  • 3-5: Moderate beam search (balanced)
  • 10+: Large beam (slow, diminishing returns)
Impact: 2-3x slower with beam size 5, marginal accuracy gain

Temperature Sampling

const options = {
  temperature: 0.0,  // Deterministic (default)
  // temperature: 0.2,  // Slight randomness
  // temperature: 0.8,  // More creative (for generation tasks)
}
temperature
number
default:"0.0"
Sampling temperature for decoding
  • 0.0: Deterministic, fastest
  • 0.0-0.5: Low randomness
  • 0.5-1.0: Higher randomness (slower)

Context and Length

const options = {
  maxContext: -1,  // Unlimited context (default)
  // maxContext: 224,  // Limit context window
  
  maxLen: 0,       // No segment length limit (default)
  // maxLen: 20,      // Limit segment to 20 tokens (faster)
}

Realtime Transcription Optimization

Slice Configuration

import { RealtimeTranscriber } from 'whisper.rn/src/realtime-transcription'

const transcriber = new RealtimeTranscriber(
  { whisperContext, audioStream },
  {
    audioSliceSec: 25,  // Shorter than 30s for better responsiveness
    audioMinSec: 1,     // Minimum before transcribing
    maxSlicesInMemory: 2,  // Lower memory usage
    
    transcribeOptions: {
      language: 'en',    // Skip auto-detection
      maxThreads: 4,     // More threads for realtime
      beamSize: 1,       // Fast greedy decoding
    },
    
    realtimeProcessingPauseMs: 200,  // Update frequency
    initRealtimeAfterMs: 200,        // Delay before first transcription
  }
)
audioSliceSec
number
default:"30"
Slice duration in seconds
  • 20-25s: More responsive, more overhead
  • 30s: Optimal (matches Whisper chunk size)
  • 40+s: Slower updates, less overhead
realtimeProcessingPauseMs
number
default:"200"
Interval between realtime processing updates
  • 100ms: Very responsive, higher CPU
  • 200ms: Balanced (default)
  • 500ms: Lower CPU, less responsive

VAD Configuration

VAD (Voice Activity Detection) reduces unnecessary transcription.
import { VAD_PRESETS } from 'whisper.rn/src/realtime-transcription/types'

const transcriber = new RealtimeTranscriber(
  {
    whisperContext,
    vadContext,  // Include VAD context
    audioStream,
  },
  {
    // Use preset or custom VAD options
    transcribeOptions: {
      // VAD preset applied via vadContext.updateOptions()
    },
  }
)
VAD Presets:
types.ts
export const VAD_PRESETS = {
  default: {
    threshold: 0.5,
    minSpeechDurationMs: 250,
    minSilenceDurationMs: 100,
  },
  
  sensitive: {
    threshold: 0.3,           // Detects quieter speech
    minSpeechDurationMs: 100,
  },
  
  conservative: {
    threshold: 0.7,           // Avoids false positives
    minSpeechDurationMs: 500,
  },
  
  noisy: {
    threshold: 0.75,          // Strict for noisy environments
    minSpeechDurationMs: 400,
  },
}

Platform-Specific Optimizations

iOS

1. Use Metal Acceleration
const whisperContext = await initWhisper({
  filePath: modelPath,
  useGpu: true,  // Metal (default)
})
2. Build Optimizations
Podfile
# Use pre-built framework (faster builds)
pod install

# Or build from source with optimizations
ENV['RNWHISPER_BUILD_FROM_SOURCE'] = '1'
pod install
3. Extended Virtual Addressing (for large models)
Info.plist
<key>com.apple.developer.kernel.extended-virtual-addressing</key>
<true/>

Android

1. NDK Configuration
android/build.gradle
android {
  ndkVersion "24.0.8215888"  // Recommended for Apple Silicon Macs
  
  defaultConfig {
    ndk {
      abiFilters 'arm64-v8a', 'armeabi-v7a'  // Limit to ARM
    }
  }
}
2. ProGuard (for release builds)
proguard-rules.pro
-keep class com.rnwhisper.** { *; }
3. 16KB Page Size Support (Android 15+)
  • Automatically handled by whisper.cpp build system

Testing Performance

Release Mode

Always test performance in Release mode (Debug is 5-10x slower).
# iOS
yarn example ios --mode Release

# Android
yarn example android --mode release

Benchmarking

import { initWhisper } from 'whisper.rn'

const runBenchmark = async () => {
  const whisperContext = await initWhisper({ filePath: modelPath })
  
  console.log('GPU:', whisperContext.gpu)
  console.log('---')
  
  // Test different thread counts
  for (let threads = 1; threads <= 8; threads++) {
    const start = Date.now()
    const result = await whisperContext.bench(threads)
    const total = Date.now() - start
    
    console.log(`Threads: ${threads}`)
    console.log(`Total time: ${total}ms`)
    console.log(`Encode: ${result.encodeMs}ms`)
    console.log(`Decode: ${result.decodeMs}ms`)
    console.log(`Throughput: ${(30000 / total).toFixed(2)}x realtime`)
    console.log('---')
  }
  
  await whisperContext.release()
}

runBenchmark()

Real-world Testing

const testTranscription = async () => {
  const whisperContext = await initWhisper({
    filePath: modelPath,
    useGpu: true,
  })
  
  const configs = [
    { name: 'Fast', threads: 2, beam: 1 },
    { name: 'Balanced', threads: 4, beam: 1 },
    { name: 'Accurate', threads: 4, beam: 5 },
  ]
  
  for (const config of configs) {
    const start = Date.now()
    
    const { promise } = whisperContext.transcribe(audioPath, {
      maxThreads: config.threads,
      beamSize: config.beam,
      language: 'en',
    })
    
    const result = await promise
    const duration = Date.now() - start
    
    console.log(`${config.name}:`, duration, 'ms')
    console.log('Result:', result.result.substring(0, 50))
  }
  
  await whisperContext.release()
}

Performance Checklist

  • Use smallest model that meets accuracy needs
  • Consider quantized models (q8, q5)
  • Avoid medium/large on mobile
  • Use .en models for English-only
  • Set language (avoid ‘auto’)
  • Use beamSize: 1 for speed
  • Tune maxThreads with bench()
  • Enable GPU on iOS (useGpu: true)
  • Limit maxSlicesInMemory for realtime
  • Always test in Release mode
  • Benchmark on target devices
  • Monitor memory usage
  • Test with realistic audio (length, quality)
  • Release contexts after use
  • Handle out-of-memory errors
  • Provide fallback for low-end devices
  • Monitor crash analytics

Troubleshooting Performance

Symptoms: Transcription takes longer than audio durationSolutions:
  • Test in Release mode (not Debug)
  • Use smaller model (base/tiny instead of small/medium)
  • Increase maxThreads (test with bench())
  • Enable GPU on iOS (useGpu: true)
  • Set language explicitly (skip auto-detection)
  • Use beamSize: 1 (greedy decoding)
Solutions:
  • Use smaller model
  • Reduce maxSlicesInMemory (1-2 for realtime)
  • Release contexts: await context.release()
  • Disable promptPreviousSlices
  • Enable Extended Virtual Addressing (iOS large models)
Solutions:
  • Reduce maxThreads (try 1-2)
  • Use smaller model
  • Increase realtimeProcessingPauseMs
  • Enable GPU (iOS) - more efficient than CPU
  • Use VAD to skip silence
Check:
console.log('GPU:', whisperContext.gpu)
console.log('Reason:', whisperContext.reasonNoGPU)
Reasons:
  • Running on simulator (Metal not available)
  • RNWHISPER_DISABLE_METAL=1 build flag set
  • Device doesn’t support Metal
Solution: Test on physical device, check Podfile

Build docs developers (and LLMs) love