Skip to main content

Overview

whisper.rn uses GGML (GPT-Generated Model Language) formatted models from whisper.cpp. Understanding model types, sizes, and optimization options is crucial for balancing accuracy and performance.

GGML Model Format

GGML is a tensor library for machine learning, used by whisper.cpp for efficient inference. All Whisper models must be converted to GGML format (.bin files) to work with whisper.rn.

Model Download

Official GGML models are available from Hugging Face:
# Base URL
https://huggingface.co/ggerganov/whisper.cpp/tree/main

# Example: Download tiny.en model
curl -LO https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin
Models are hosted on Hugging Face at the official whisper.cpp repository. Always download from trusted sources to ensure model integrity.

Model Sizes

Whisper models come in several sizes, trading accuracy for speed and memory:
ModelParametersGGML SizeMemorySpeed (iPhone 13)Use Case
tiny39 M75 MB~75 MB~1x realtimeQuick drafts, testing
tiny.en39 M75 MB~75 MB~1x realtimeEnglish-only, fastest
base74 M140 MB~140 MB~0.8x realtimeGood speed/accuracy
base.en74 M140 MB~140 MB~0.8x realtimeEnglish-only
small244 M460 MB~460 MB~0.5x realtimeProduction quality
small.en244 M460 MB~460 MB~0.5x realtimeEnglish production
medium769 M1.5 GB~1.5 GB~0.2x realtimeHigh accuracy
medium.en769 M1.5 GB~1.5 GB~0.2x realtimeEnglish high accuracy
large-v11550 M2.9 GB~2.9 GB~0.1x realtimeBest accuracy
large-v21550 M2.9 GB~2.9 GB~0.1x realtimeImproved v1
large-v31550 M2.9 GB~2.9 GB~0.1x realtimeLatest, best
  • .en models: English-only, optimized for English transcription
  • Multilingual models: Support 99 languages but slightly slower
  • Speed metrics are approximate and vary by device, settings, and audio content

Model Selection Guide

Mobile Apps (iOS/Android):
  • Recommended: tiny.en or base.en for English
  • Alternative: small for better accuracy (if memory allows)
  • Avoid: large models on mobile (too slow and memory-intensive)
Tablets/High-end Devices:
  • Recommended: small or medium
  • Use case dependent: large-v3 for offline, high-quality transcription
Real-time Transcription:
  • Required: tiny.en or base.en
  • Models must process faster than realtime (>1x speed)

Quantized Models

Quantization reduces model size and improves speed by using lower precision for weights:

Quantization Formats

FormatPrecisionSize vs f16QualityDescription
f1616-bit float100%BestOriginal precision
q8_08-bit int~50%Very goodRecommended balance
q5_05-bit int~35%GoodSmaller, faster
q4_04-bit int~25%FairSmallest, quality loss
Quantization below q5_0 may cause noticeable quality degradation. Test thoroughly before deploying q4_0 models.

Quantized Model Examples

# Download quantized models
ggml-tiny.en-q8_0.bin      # 8-bit quantized tiny.en (~40 MB)
ggml-base.en-q8_0.bin      # 8-bit quantized base.en (~75 MB)
ggml-small-q5_0.bin        # 5-bit quantized small (~160 MB)

Using Quantized Models

import { initWhisper } from 'whisper.rn'

// Use quantized model (same API as regular models)
const context = await initWhisper({
  filePath: 'file:///path/to/ggml-base.en-q8_0.bin',
})

// Transcription works identically
const { promise } = context.transcribe(audioFile, { language: 'en' })
const result = await promise
Quantized models are drop-in replacements. No code changes required!

Core ML Acceleration (iOS)

Core ML is Apple’s machine learning framework, providing hardware-accelerated inference on iOS and tvOS.

Core ML Model Structure

Core ML models accelerate the encoder (the slowest part of Whisper). The decoder still uses the GGML model. File structure:
ggml-tiny.en.bin                    # GGML model (required)
ggml-tiny.en-encoder.mlmodelc/      # Core ML encoder (optional)
  ├── model.mil                      # Model interface language
  ├── coremldata.bin                 # Core ML data
  ├── weights/
  │   └── weight.bin                 # Model weights
  ├── metadata.json                  # Optional metadata
  └── analytics/
      └── coremldata.bin             # Optional analytics
Core ML models are directories (.mlmodelc), not single files. Only 3 files are required: model.mil, coremldata.bin, and weights/weight.bin.

Downloading Core ML Models

Core ML models are hosted alongside GGML models:
# Models are distributed as ZIP archives
https://huggingface.co/ggerganov/whisper.cpp/tree/main

# Example: Download and extract tiny.en Core ML
curl -LO https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en-encoder.mlmodelc.zip
unzip ggml-tiny.en-encoder.mlmodelc.zip

Using Core ML Models

Option 1: Runtime Download

Download and extract Core ML models at runtime:
import RNFS from 'react-native-fs'
import { unzip } from 'react-native-zip-archive'

async function downloadCoreMLModel() {
  const modelUrl = 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en-encoder.mlmodelc.zip'
  const zipPath = `${RNFS.DocumentDirectoryPath}/coreml-model.zip`
  const extractPath = `${RNFS.DocumentDirectoryPath}/models/`
  
  // Download
  await RNFS.downloadFile({
    fromUrl: modelUrl,
    toFile: zipPath,
  }).promise
  
  // Extract
  await unzip(zipPath, extractPath)
  
  // Cleanup zip
  await RNFS.unlink(zipPath)
  
  return `${extractPath}ggml-tiny.en-encoder.mlmodelc`
}

// Initialize with Core ML
const context = await initWhisper({
  filePath: `${RNFS.DocumentDirectoryPath}/models/ggml-tiny.en.bin`,
  useCoreMLIos: true, // Enable Core ML (default: true)
})

if (context.gpu) {
  console.log('Using Core ML acceleration!')
} else {
  console.log('Core ML not available:', context.reasonNoGPU)
}

Option 2: Bundle with App

Bundle Core ML models using Metro bundler (increases app size):
import { Platform } from 'react-native'
import { initWhisper } from 'whisper.rn'

const context = await initWhisper({
  filePath: require('../assets/ggml-tiny.en.bin'),
  coreMLModelAsset: Platform.OS === 'ios' ? {
    filename: 'ggml-tiny.en-encoder.mlmodelc',
    assets: [
      require('../assets/ggml-tiny.en-encoder.mlmodelc/weights/weight.bin'),
      require('../assets/ggml-tiny.en-encoder.mlmodelc/model.mil'),
      require('../assets/ggml-tiny.en-encoder.mlmodelc/coremldata.bin'),
    ],
  } : undefined,
})
Update metro.config.js:
const defaultAssetExts = require('metro-config/src/defaults/defaults').assetExts

module.exports = {
  resolver: {
    assetExts: [
      ...defaultAssetExts,
      'bin',  // GGML models
      'mil',  // Core ML interface
    ],
  },
}
Bundling large models significantly increases app size:
  • tiny.en: +75 MB (GGML) + ~35 MB (Core ML) = 110 MB
  • base.en: +140 MB (GGML) + ~65 MB (Core ML) = 205 MB
For production apps, prefer runtime download.

Core ML Performance

Core ML acceleration provides significant speedups:
ModelCPU OnlyCore MLSpeedup
tiny.en1x realtime3-4x realtime3-4x
base.en0.8x realtime2-3x realtime2.5-3.5x
small0.5x realtime1.5-2x realtime3-4x
Core ML speedup varies by device. Neural Engine (A12+) provides best acceleration.

Disabling Core ML

Disable Core ML even if model files exist:
const context = await initWhisper({
  filePath: 'file:///path/to/ggml-tiny.en.bin',
  useCoreMLIos: false, // Disable Core ML
})

console.log('GPU enabled:', context.gpu) // false

Core ML Build Configuration

Control Core ML compilation in iOS builds: Disable Core ML in Podfile:
pre_install do |installer|
  ENV['RNWHISPER_DISABLE_COREML'] = '1'
end
Check Core ML availability at runtime:
import { isUseCoreML, isCoreMLAllowFallback } from 'whisper.rn'

if (isUseCoreML) {
  console.log('Core ML support compiled in')
  
  if (isCoreMLAllowFallback) {
    console.log('Fallback to CPU enabled if Core ML fails')
  }
}

Metal/GPU Acceleration

Metal provides GPU acceleration on iOS and tvOS (alternative to Core ML).

Enabling Metal

const context = await initWhisper({
  filePath: 'file:///path/to/model.bin',
  useGpu: true,          // Enable Metal (default: true)
  useFlashAttn: false,   // Flash Attention (requires GPU, default: false)
})

if (context.gpu) {
  console.log('Using Metal GPU acceleration')
}
If both Core ML and Metal are enabled, Core ML takes priority. Set useCoreMLIos: false to force Metal.

Flash Attention

Flash Attention is an optimized attention mechanism for GPUs:
const context = await initWhisper({
  filePath: 'file:///path/to/model.bin',
  useGpu: true,
  useFlashAttn: true,  // Enable Flash Attention
})
Flash Attention only works when GPU is available. Ignored if useGpu: false.

Disabling Metal

Disable Metal compilation in Podfile:
pre_install do |installer|
  ENV['RNWHISPER_DISABLE_METAL'] = '1'
end

Model Management

Bundling Models with App

Pros:
  • Works offline immediately
  • No download wait time
  • No network dependency
Cons:
  • Large app size increase
  • Cannot update models without app update
  • App Store size limits
// Bundle model as asset
const context = await initWhisper({
  filePath: require('../assets/ggml-tiny.en.bin'),
})

Runtime Model Download

Pros:
  • Smaller app size
  • Can update models without app update
  • User can choose model size
Cons:
  • Requires network on first use
  • Storage management needed
  • Download errors to handle
import RNFS from 'react-native-fs'

async function downloadModel(modelName: string) {
  const modelUrl = `https://huggingface.co/ggerganov/whisper.cpp/resolve/main/${modelName}`
  const modelPath = `${RNFS.DocumentDirectoryPath}/models/${modelName}`
  
  // Check if already downloaded
  const exists = await RNFS.exists(modelPath)
  if (exists) {
    console.log('Model already downloaded')
    return modelPath
  }
  
  // Create directory
  await RNFS.mkdir(`${RNFS.DocumentDirectoryPath}/models`)
  
  // Download with progress
  const download = RNFS.downloadFile({
    fromUrl: modelUrl,
    toFile: modelPath,
    progressInterval: 1000,
    progressDivider: 1,
    begin: (res) => {
      console.log('Download started:', res.contentLength, 'bytes')
    },
    progress: (res) => {
      const progress = (res.bytesWritten / res.contentLength) * 100
      console.log(`Progress: ${progress.toFixed(2)}%`)
    },
  })
  
  const result = await download.promise
  
  if (result.statusCode === 200) {
    console.log('Download complete:', modelPath)
    return modelPath
  } else {
    throw new Error(`Download failed: ${result.statusCode}`)
  }
}

// Usage
const modelPath = await downloadModel('ggml-tiny.en.bin')
const context = await initWhisper({ filePath: modelPath })

Model Caching Strategy

class ModelManager {
  private static models = new Map<string, string>()
  
  static async getModel(name: string): Promise<string> {
    // Check memory cache
    if (this.models.has(name)) {
      return this.models.get(name)!
    }
    
    // Check disk cache
    const cachedPath = `${RNFS.DocumentDirectoryPath}/models/${name}`
    const exists = await RNFS.exists(cachedPath)
    
    if (exists) {
      this.models.set(name, cachedPath)
      return cachedPath
    }
    
    // Download
    const downloadedPath = await downloadModel(name)
    this.models.set(name, downloadedPath)
    return downloadedPath
  }
  
  static async clearCache() {
    const modelsDir = `${RNFS.DocumentDirectoryPath}/models`
    await RNFS.unlink(modelsDir)
    this.models.clear()
  }
}

Model Conversion

Convert original Whisper models to GGML format:
# Clone whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp

# Install Python dependencies
pip install -r requirements.txt

# Download original Whisper model
python -m whisper --model tiny.en --output_dir models/

# Convert to GGML
python convert-pt-to-ggml.py models/tiny.en/model.pt models/ tiny.en

# Output: models/ggml-tiny.en.bin

Quantizing Models

# Build quantization tool
make quantize

# Quantize to q8_0
./quantize models/ggml-base.en.bin models/ggml-base.en-q8_0.bin q8_0

# Quantize to q5_0
./quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0

Best Practices

1. Start Small, Scale Up

// Development: Fast iteration with tiny
const devContext = await initWhisper({
  filePath: require('../assets/ggml-tiny.en.bin'),
})

// Production: Better quality with base
const prodContext = await initWhisper({
  filePath: await ModelManager.getModel('ggml-base.en-q8_0.bin'),
})

2. Use Quantized Models in Production

// ❌ Avoid: Full precision models in production
const context = await initWhisper({
  filePath: 'ggml-base.en.bin', // 140 MB
})

// ✅ Recommended: Quantized models
const context = await initWhisper({
  filePath: 'ggml-base.en-q8_0.bin', // 75 MB, minimal quality loss
})

3. Enable Hardware Acceleration

// ✅ Best: Enable all optimizations
const context = await initWhisper({
  filePath: modelPath,
  useGpu: true,        // Metal/GPU
  useCoreMLIos: true,  // Core ML (iOS)
  useFlashAttn: false, // Conservative default
})

4. Validate Model Files

async function validateModel(filePath: string) {
  // Check file exists
  const exists = await RNFS.exists(filePath)
  if (!exists) {
    throw new Error('Model file not found')
  }
  
  // Check file size (GGML models > 1MB)
  const stat = await RNFS.stat(filePath)
  if (stat.size < 1024 * 1024) {
    throw new Error('Model file too small, may be corrupted')
  }
  
  // Check GGML magic number (first 4 bytes: 'GGML')
  const header = await RNFS.read(filePath, 4, 0, 'utf8')
  if (header !== 'GGML' && header !== 'ggml') {
    throw new Error('Not a valid GGML model file')
  }
}

Next Steps

Performance

Optimize transcription performance and threading

Audio Formats

Learn about audio format requirements

Build docs developers (and LLMs) love