Skip to main content
Voice Activity Detection (VAD) uses the Silero VAD model to identify speech segments in audio, filtering out silence and background noise.

Overview

VAD is useful for:
  • Pre-processing audio before transcription to reduce processing time
  • Detecting speech boundaries for segmentation
  • Filtering out silent portions of recordings
  • Triggering transcription only when speech is present

Initialize VAD Context

import { initWhisperVad } from 'whisper.rn'

const vadContext = await initWhisperVad({
  filePath: require('./assets/ggml-silero-v6.2.0.bin'), // VAD model file
  useGpu: true,        // Use GPU acceleration (iOS only)
  nThreads: 4,         // Number of threads for processing
})
Download the Silero VAD model from the whisper.cpp releases or Hugging Face.

Initialization Options

type VadContextOptions = {
  filePath: string | number       // Path to VAD model or require() asset
  isBundleAsset?: boolean         // Is filePath a bundle asset (for string paths)
  useGpu?: boolean                // Use GPU acceleration (iOS only, default: true)
  nThreads?: number               // Processing threads (default: 2 for 4-core, 4 for 8+)
}

Detect Speech Segments

From Audio Files

Detect speech in WAV files, base64 audio, or bundled assets:
const segments = await vadContext.detectSpeech(
  'file:///path/to/audio.wav',
  {
    threshold: 0.5,              // Speech probability threshold (0.0-1.0)
    minSpeechDurationMs: 250,    // Minimum speech duration in ms
    minSilenceDurationMs: 100,   // Minimum silence duration in ms
    maxSpeechDurationS: 30,      // Maximum speech duration in seconds
    speechPadMs: 30,             // Padding around speech segments in ms
    samplesOverlap: 0.1,         // Overlap between analysis windows
  }
)

for (const segment of segments) {
  console.log(`Speech: ${segment.t0}ms - ${segment.t1}ms`)
}

From Raw Audio Data

Detect speech in base64-encoded PCM or ArrayBuffer:
// Most efficient - uses JSI bindings
const audioBuffer: ArrayBuffer = // ... 16-bit PCM, mono, 16kHz

const segments = await vadContext.detectSpeechData(
  audioBuffer,
  {
    threshold: 0.5,
    minSpeechDurationMs: 250,
  }
)

VAD Options

type VadOptions = {
  threshold?: number              // Probability threshold (0.0-1.0, default: 0.5)
  minSpeechDurationMs?: number    // Min speech duration in ms (default: 250)
  minSilenceDurationMs?: number   // Min silence to end speech in ms (default: 100)
  maxSpeechDurationS?: number     // Max continuous speech in seconds (default: 30)
  speechPadMs?: number            // Padding before/after speech in ms (default: 30)
  samplesOverlap?: number         // Analysis window overlap (0.0-1.0, default: 0.1)
}

Option Explanations

OptionDescriptionTypical Range
thresholdConfidence level to classify as speech (lower = more sensitive)0.3-0.8
minSpeechDurationMsIgnore detections shorter than this100-500 ms
minSilenceDurationMsHow much silence ends a speech segment50-300 ms
maxSpeechDurationSForce segment split after this duration15-60 s
speechPadMsExtra audio before/after detected speech10-100 ms
samplesOverlapOverlap between analysis windows (higher = smoother)0.05-0.3

VAD Result

type VadSegment = {
  t0: number  // Start time in seconds
  t1: number  // End time in seconds
}
Example result:
[
  { t0: 0.5, t1: 3.2 },    // First speech segment: 0.5s - 3.2s
  { t0: 4.1, t1: 7.8 },    // Second speech segment: 4.1s - 7.8s
  { t0: 9.0, t1: 12.5 }    // Third speech segment: 9.0s - 12.5s
]

Processing Results

1

Detect segments

const segments = await vadContext.detectSpeech(
  audioPath,
  { threshold: 0.5 }
)
2

Analyze segments

segments.forEach((segment, index) => {
  const duration = segment.t1 - segment.t0
  console.log(`Segment ${index + 1}:`)
  console.log(`  Start: ${segment.t0.toFixed(2)}s`)
  console.log(`  End: ${segment.t1.toFixed(2)}s`)
  console.log(`  Duration: ${duration.toFixed(2)}s`)
})
3

Filter by duration

// Only keep segments longer than 1 second
const longSegments = segments.filter(
  segment => (segment.t1 - segment.t0) > 1.0
)

Common Use Cases

Pre-filter Audio for Transcription

import { initWhisper, initWhisperVad } from 'whisper.rn'

const whisperContext = await initWhisper({ filePath: modelPath })
const vadContext = await initWhisperVad({ filePath: vadModelPath })

// Detect speech segments
const segments = await vadContext.detectSpeech(audioPath, { threshold: 0.5 })

// Transcribe only speech segments
for (const segment of segments) {
  const { promise } = whisperContext.transcribe(audioPath, {
    language: 'en',
    offset: segment.t0 * 1000,        // Convert to ms
    duration: (segment.t1 - segment.t0) * 1000
  })
  
  const { result } = await promise
  console.log(`Segment ${segment.t0}s-${segment.t1}s: ${result}`)
}

Real-time Speech Detection

See the Realtime Transcription guide for VAD integration with live audio.

Optimize for Different Environments

Use a lower threshold to catch softer speech:
const segments = await vadContext.detectSpeech(audioPath, {
  threshold: 0.3,              // More sensitive
  minSpeechDurationMs: 100,    // Catch short utterances
  speechPadMs: 50              // More padding
})

Memory Management

// Release single VAD context
await vadContext.release()

// Release all VAD contexts
import { releaseAllWhisperVad } from 'whisper.rn'
await releaseAllWhisperVad()
Always release VAD contexts when done to free native resources.

Audio Format Requirements

Like Whisper transcription, VAD requires:
PropertyRequirement
Sample Rate16kHz
ChannelsMono (1 channel)
Bit Depth16-bit PCM
FormatWAV file or raw PCM data

Performance Tips

  1. Enable GPU on iOS - Set useGpu: true for better performance
  2. Adjust thread count - More threads for faster processing on multi-core devices
  3. Use ArrayBuffer - JSI bindings provide the best performance for in-memory audio
  4. Tune threshold - Start with 0.5 and adjust based on your audio environment
  5. Balance segment length - Shorter maxSpeechDurationS means more segments but better granularity

Error Handling

try {
  const segments = await vadContext.detectSpeech(audioPath, options)
  
  if (segments.length === 0) {
    console.log('No speech detected in audio')
  } else {
    console.log(`Found ${segments.length} speech segments`)
  }
} catch (error) {
  console.error('VAD failed:', error)
  // Handle errors (invalid file, unsupported format, etc.)
}

See Also

Build docs developers (and LLMs) love