Overview
whisper.rn requires specific audio format constraints to work with the underlying whisper.cpp engine. Understanding these requirements is essential for successful transcription.
Core Requirements
Whisper models require audio in this exact format:
{
sampleRate : 16000 , // 16 kHz (required by Whisper)
channels : 1 , // Mono (single channel)
format : 'PCM' , // Pulse Code Modulation
bitDepth : 16 // 16-bit samples
}
Audio that doesn’t meet these requirements will produce incorrect transcriptions or errors. Always convert audio to the correct format before transcription.
Why These Requirements?
16 kHz Sample Rate:
Whisper models were trained on 16 kHz audio
Higher sample rates (44.1 kHz, 48 kHz) are automatically downsampled by whisper.cpp
Lower sample rates may reduce transcription quality
Mono Channel:
Whisper processes single-channel audio
Stereo audio is automatically mixed to mono by the native layer
16-bit PCM:
Uncompressed linear PCM format
Each sample is a 16-bit signed integer (-32768 to 32767)
whisper.rn accepts audio in several formats, handled by the native audio utilities (cpp/rn-audioutils.cpp):
1. WAV Files
Standard WAV files with automatic format conversion:
import { initWhisper } from 'whisper.rn'
const context = await initWhisper ({
filePath: require ( '../assets/model.bin' ),
})
// WAV file (any sample rate, mono or stereo)
const { promise } = context . transcribe (
'file:///path/to/audio.wav' ,
{ language: 'en' }
)
The native layer automatically:
Resamples to 16 kHz if needed
Converts stereo to mono
Converts to 16-bit PCM if needed
2. Base64-Encoded WAV
For network transfers or embedded audio:
// Base64 WAV data must include the data URI prefix
const base64Wav = 'data:audio/wav;base64,UklGRiQAAABXQVZF...'
const { promise } = context . transcribe ( base64Wav , {
language: 'en' ,
})
3. Raw PCM Data (Base64)
Base64-encoded float32 PCM samples:
// Float32 PCM samples (-1.0 to 1.0), mono, 16 kHz
const base64Pcm = 'AAAAAAA...'
const { promise } = context . transcribeData ( base64Pcm , {
language: 'en' ,
})
For transcribeData(), the base64 string represents float32 samples (not int16), where each sample is normalized to the range -1.0 to 1.0.
4. ArrayBuffer (Fastest)
Direct memory transfer using JSI bindings (cpp/jsi/RNWhisperJSI.cpp):
// 16-bit PCM samples, mono, 16 kHz
const audioBuffer : ArrayBuffer = new Int16Array ([
// ... 16-bit PCM samples
]). buffer
const { promise } = context . transcribeData ( audioBuffer , {
language: 'en' ,
})
ArrayBuffer inputs bypass JSON serialization entirely (index.ts:402-405), providing the best performance for real-time or large audio processing.
Converting Audio Files
Use ffmpeg to convert audio to the required format:
# Convert any audio file to 16kHz mono WAV
ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 output.wav
# From stereo to mono
ffmpeg -i stereo.wav -ac 1 mono.wav
# Resample to 16kHz
ffmpeg -i input.wav -ar 16000 output.wav
JavaScript Conversion
For web-based or React Native apps:
import RNFS from 'react-native-fs'
import { Audio } from 'expo-av'
// Record audio at correct settings
const recording = new Audio . Recording ()
await recording . prepareToRecordAsync ({
android: {
extension: '.wav' ,
outputFormat: Audio . RECORDING_OPTION_ANDROID_OUTPUT_FORMAT_DEFAULT ,
audioEncoder: Audio . RECORDING_OPTION_ANDROID_AUDIO_ENCODER_DEFAULT ,
sampleRate: 16000 ,
numberOfChannels: 1 ,
bitRate: 128000 ,
},
ios: {
extension: '.wav' ,
audioQuality: Audio . RECORDING_OPTION_IOS_AUDIO_QUALITY_HIGH ,
sampleRate: 16000 ,
numberOfChannels: 1 ,
bitRate: 128000 ,
linearPCMBitDepth: 16 ,
linearPCMIsBigEndian: false ,
linearPCMIsFloat: false ,
},
web: {
mimeType: 'audio/wav' ,
bitsPerSecond: 128000 ,
},
})
await recording . startAsync ()
PCM Stream Processing
For real-time audio streams:
import AudioPcmStream from '@fugood/react-native-audio-pcm-stream'
// Configure for 16kHz mono
const stream = new AudioPcmStream ({
sampleRate: 16000 ,
channels: 1 ,
bitsPerSample: 16 ,
})
stream . on ( 'data' , ( data : Buffer ) => {
// data contains Int16Array PCM samples
const int16Array = new Int16Array ( data . buffer )
// Convert to ArrayBuffer for transcription
const audioBuffer = int16Array . buffer
context . transcribeData ( audioBuffer , options )
})
Validate WAV files before transcription:
import RNFS from 'react-native-fs'
async function validateWavFile ( filePath : string ) {
// Read first 44 bytes (WAV header)
const header = await RNFS . read ( filePath , 44 , 0 , 'base64' )
const buffer = Buffer . from ( header , 'base64' )
// Check RIFF header
const riff = buffer . toString ( 'ascii' , 0 , 4 )
if ( riff !== 'RIFF' ) {
throw new Error ( 'Not a valid WAV file' )
}
// Check WAVE format
const wave = buffer . toString ( 'ascii' , 8 , 12 )
if ( wave !== 'WAVE' ) {
throw new Error ( 'Not a WAVE file' )
}
// Read audio format (offset 20, 2 bytes)
const audioFormat = buffer . readUInt16LE ( 20 )
if ( audioFormat !== 1 ) {
throw new Error ( 'Not PCM format' )
}
// Read number of channels (offset 22, 2 bytes)
const channels = buffer . readUInt16LE ( 22 )
// Read sample rate (offset 24, 4 bytes)
const sampleRate = buffer . readUInt32LE ( 24 )
// Read bit depth (offset 34, 2 bytes)
const bitDepth = buffer . readUInt16LE ( 34 )
console . log ( 'WAV Info:' , {
channels ,
sampleRate ,
bitDepth ,
isValid: channels <= 2 && sampleRate >= 8000 && bitDepth === 16 ,
})
return { channels , sampleRate , bitDepth }
}
function detectAudioFormat ( filePath : string ) {
const ext = filePath . split ( '.' ). pop ()?. toLowerCase ()
switch ( ext ) {
case 'wav' :
return 'wav'
case 'mp3' :
case 'm4a' :
case 'aac' :
throw new Error (
` ${ ext } format not supported. Convert to WAV with: ` +
`ffmpeg -i input. ${ ext } -ar 16000 -ac 1 output.wav`
)
default :
throw new Error ( `Unknown audio format: ${ ext } ` )
}
}
Memory Considerations
Audio Buffer Sizes
Calculate memory usage for audio buffers:
function calculateBufferSize ( durationSec : number , sampleRate = 16000 ) {
const samples = durationSec * sampleRate
const bytes = samples * 2 // 16-bit = 2 bytes per sample
const mb = bytes / ( 1024 * 1024 )
return {
samples ,
bytes ,
mb: mb . toFixed ( 2 ),
}
}
// Examples:
console . log ( '30 seconds:' , calculateBufferSize ( 30 ))
// { samples: 480000, bytes: 960000, mb: '0.92' }
console . log ( '5 minutes:' , calculateBufferSize ( 300 ))
// { samples: 4800000, bytes: 9600000, mb: '9.16' }
Large audio files consume significant memory. For files longer than 30 seconds, consider:
Using the RealtimeTranscriber with auto-slicing
Processing audio in chunks
Implementing a queue system
Optimizing for Mobile
const MAX_AUDIO_DURATION_SEC = 30
const MAX_FILE_SIZE_MB = 10
async function validateAudioSize ( filePath : string ) {
const stats = await RNFS . stat ( filePath )
const sizeMB = stats . size / ( 1024 * 1024 )
if ( sizeMB > MAX_FILE_SIZE_MB ) {
throw new Error (
`Audio file too large: ${ sizeMB . toFixed ( 2 ) } MB. ` +
`Maximum: ${ MAX_FILE_SIZE_MB } MB`
)
}
// Estimate duration (for 16kHz mono WAV)
const estimatedDuration = stats . size / ( 16000 * 2 )
if ( estimatedDuration > MAX_AUDIO_DURATION_SEC ) {
console . warn (
`Long audio detected: ~ ${ estimatedDuration . toFixed ( 0 ) } s. ` +
'Consider using RealtimeTranscriber for better memory management.'
)
}
}
Issue: Garbled Transcription
Cause: Incorrect sample rate or channels
Solution:
// ❌ Wrong: Using 44.1kHz stereo audio directly
const result = await context . transcribe ( '44100hz-stereo.wav' ). promise
// Output: Garbled or nonsense text
// ✅ Correct: Convert to 16kHz mono first
// ffmpeg -i input.wav -ar 16000 -ac 1 output.wav
const result = await context . transcribe ( '16000hz-mono.wav' ). promise
Issue: Silent Audio / No Transcription
Cause: Audio levels too low or format mismatch
Solution:
// Check audio amplitude
function checkAudioLevel ( samples : Int16Array ) {
const maxAmplitude = Math . max ( ... samples . map ( Math . abs ))
const threshold = 1000 // Minimum amplitude
if ( maxAmplitude < threshold ) {
console . warn (
`Audio level too low: ${ maxAmplitude } . ` +
'Recording may be silent or gain too low.'
)
}
}
Issue: MP3/M4A Not Working
Cause: Only WAV format is supported
Solution:
// Convert MP3 to WAV before transcription
import { FFmpegKit } from 'ffmpeg-kit-react-native'
async function convertToWav ( inputPath : string , outputPath : string ) {
const command = `-i ${ inputPath } -ar 16000 -ac 1 -sample_fmt s16 ${ outputPath } `
const session = await FFmpegKit . execute ( command )
const returnCode = await session . getReturnCode ()
if ( returnCode . isValueSuccess ()) {
console . log ( 'Conversion successful' )
return outputPath
} else {
throw new Error ( 'Conversion failed' )
}
}
Best Practices
Configure audio recording to use 16 kHz mono from the start:
const recordingOptions = {
sampleRate: 16000 ,
numberOfChannels: 1 ,
bitRate: 128000 ,
linearPCMBitDepth: 16 ,
}
2. Validate Before Processing
async function transcribeWithValidation (
context : WhisperContext ,
filePath : string ,
options : TranscribeOptions
) {
// Validate file exists
const exists = await RNFS . exists ( filePath )
if ( ! exists ) {
throw new Error ( 'Audio file not found' )
}
// Validate format
detectAudioFormat ( filePath )
// Validate size
await validateAudioSize ( filePath )
// Transcribe
return context . transcribe ( filePath , options )
}
3. Use ArrayBuffer for Real-time
For real-time transcription, use ArrayBuffer to avoid serialization overhead:
// ❌ Slower: Base64 encoding
const base64 = Buffer . from ( pcmData ). toString ( 'base64' )
const result = await context . transcribeData ( base64 , options )
// ✅ Faster: Direct ArrayBuffer
const buffer = new Int16Array ( pcmData ). buffer
const result = await context . transcribeData ( buffer , options )
4. Chunk Long Audio
For audio longer than 30 seconds, process in chunks:
import { RealtimeTranscriber } from 'whisper.rn/realtime-transcription'
const transcriber = new RealtimeTranscriber (
{ whisperContext , audioStream , fs: RNFS },
{
audioSliceSec: 25 , // Process in 25-second chunks
maxSlicesInMemory: 3 , // Keep only 3 chunks in memory
},
{
onTranscribe : ( event ) => {
console . log ( 'Chunk result:' , event . data ?. result )
},
}
)
iOS
Audio session must be properly configured for recording
Use AudioSessionIos utilities for session management
Core Audio handles resampling automatically
import { AudioSessionIos } from 'whisper.rn'
await AudioSessionIos . setCategory (
AudioSessionIos . Category . PlayAndRecord ,
[ AudioSessionIos . CategoryOption . DefaultToSpeaker ]
)
Android
Ensure RECORD_AUDIO permission is granted
MediaRecorder settings affect audio quality
Some devices may have hardware limitations on sample rates
import { PermissionsAndroid } from 'react-native'
const granted = await PermissionsAndroid . request (
PermissionsAndroid . PERMISSIONS . RECORD_AUDIO
)
Next Steps
Models Learn about GGML models and quantization
Performance Optimize transcription performance