Voice Activity Detection

Voice Activity Detection (VAD) identifies speech segments in audio files, filtering out silence and background noise. whisper.rn integrates the Silero VAD model for accurate speech detection.

Overview

VAD is useful for:

Pre-processing audio before transcription
Splitting long audio into speech segments
Reducing transcription costs by skipping silence
Improving transcription accuracy

Quick Start

Initialize VAD Context

First, initialize a VAD context with the Silero VAD model:

import { initWhisperVad } from 'whisper.rn';

const vadContext = await initWhisperVad({
  filePath: require('../assets/ggml-silero-v6.2.0.bin'),
  useGpu: true,
  nThreads: 4,
});

console.log('VAD model loaded, ID:', vadContext.id);

The Silero VAD model is only ~350KB, much smaller than Whisper models.

Detect Speech in Audio

Use detectSpeech() to find speech segments in an audio file:

const sampleFile = require('../assets/jfk.wav');

const segments = await vadContext.detectSpeech(sampleFile, {
  threshold: 0.5,
  minSpeechDurationMs: 250,
  minSilenceDurationMs: 100,
  maxSpeechDurationS: 30,
  speechPadMs: 30,
});

console.log(`Detected ${segments.length} speech segments`);
segments.forEach((segment, i) => {
  console.log(`Segment ${i + 1}: ${segment.t0}ms - ${segment.t1}ms`);
});

Process Results

Each segment contains start and end timestamps in centiseconds (10ms units):

function toTimestamp(t: number) {
  let msec = t * 10;
  const hr = Math.floor(msec / (1000 * 60 * 60));
  msec -= hr * (1000 * 60 * 60);
  const min = Math.floor(msec / (1000 * 60));
  msec -= min * (1000 * 60);
  const sec = Math.floor(msec / 1000);
  msec -= sec * 1000;

  return `${String(hr).padStart(2, '0')}:${String(min).padStart(2, '0')}:${String(sec).padStart(2, '0')}.${String(msec).padStart(3, '0')}`;
}

segments.forEach((segment, i) => {
  const duration = (segment.t1 - segment.t0) / 100; // Convert to seconds
  console.log(
    `${i + 1}. [${toTimestamp(segment.t0)} --> ${toTimestamp(segment.t1)}] ` +
    `Duration: ${duration.toFixed(2)}s`
  );
});

Clean Up

Release the VAD context when done:

await vadContext.release();

VAD Configuration Options

The detectSpeech() method supports several options to tune detection sensitivity:

Default Settings

const segments = await vadContext.detectSpeech(audioFile, {
  threshold: 0.5,              // Speech probability threshold (0.0 - 1.0)
  minSpeechDurationMs: 250,    // Minimum speech duration to keep
  minSilenceDurationMs: 100,   // Minimum silence to split segments
  maxSpeechDurationS: 30,      // Maximum segment length
  speechPadMs: 30,             // Padding around speech segments
  samplesOverlap: 0.1,         // Sample overlap ratio
});

Sensitive Detection

For detecting quiet or short speech:

const segments = await vadContext.detectSpeech(audioFile, {
  threshold: 0.3,              // Lower threshold = more sensitive
  minSpeechDurationMs: 100,    // Detect shorter utterances
  minSilenceDurationMs: 50,    // Less silence required to split
  maxSpeechDurationS: 15,      // Shorter max segments
  speechPadMs: 50,             // More padding for safety
  samplesOverlap: 0.2,         // More overlap for accuracy
});

Conservative Detection

For reducing false positives:

const segments = await vadContext.detectSpeech(audioFile, {
  threshold: 0.7,              // Higher threshold = less sensitive
  minSpeechDurationMs: 500,    // Only longer speech segments
  minSilenceDurationMs: 200,   // More silence required to split
  maxSpeechDurationS: 60,      // Longer max segments
  speechPadMs: 10,             // Minimal padding
  samplesOverlap: 0.05,        // Less overlap
});

Complete Example

Here’s a complete component with VAD detection:

Complete VAD Example

import React, { useCallback, useEffect, useRef, useState } from 'react';
import { View, Text, Button, ScrollView } from 'react-native';
import { initWhisperVad } from 'whisper.rn';
import type { WhisperVadContext, VadSegment } from 'whisper.rn';

const sampleFile = require('../assets/jfk.wav');

export default function VadDetection() {
  const vadContextRef = useRef<WhisperVadContext | null>(null);
  const [logs, setLogs] = useState<string[]>([]);
  const [segments, setSegments] = useState<VadSegment[]>([]);

  const log = useCallback((...messages: any[]) => {
    setLogs((prev) => [...prev, messages.join(' ')]);
  }, []);

  useEffect(() => {
    return () => {
      vadContextRef.current?.release();
    };
  }, []);

  const initialize = async () => {
    if (vadContextRef.current) {
      await vadContextRef.current.release();
      log('Released previous VAD context');
    }

    log('Initializing VAD...');
    const startTime = Date.now();
    const ctx = await initWhisperVad({
      filePath: require('../assets/ggml-silero-v6.2.0.bin'),
      useGpu: true,
      nThreads: 4,
    });
    const endTime = Date.now();
    
    log(`VAD loaded in ${endTime - startTime}ms`);
    vadContextRef.current = ctx;
  };

  const detectSpeech = async (preset: 'default' | 'sensitive' | 'conservative') => {
    if (!vadContextRef.current) {
      log('VAD not initialized');
      return;
    }

    const options = {
      default: {
        threshold: 0.5,
        minSpeechDurationMs: 250,
        minSilenceDurationMs: 100,
        maxSpeechDurationS: 30,
        speechPadMs: 30,
        samplesOverlap: 0.1,
      },
      sensitive: {
        threshold: 0.3,
        minSpeechDurationMs: 100,
        minSilenceDurationMs: 50,
        maxSpeechDurationS: 15,
        speechPadMs: 50,
        samplesOverlap: 0.2,
      },
      conservative: {
        threshold: 0.7,
        minSpeechDurationMs: 500,
        minSilenceDurationMs: 200,
        maxSpeechDurationS: 60,
        speechPadMs: 10,
        samplesOverlap: 0.05,
      },
    }[preset];

    log(`Detecting speech (${preset} mode)...`);
    const startTime = Date.now();
    
    const detectedSegments = await vadContextRef.current.detectSpeech(
      sampleFile,
      options
    );
    
    const endTime = Date.now();
    log(`Found ${detectedSegments.length} segments in ${endTime - startTime}ms`);
    setSegments(detectedSegments);
  };

  return (
    <ScrollView style={{ padding: 20 }}>
      <Button title="Initialize VAD" onPress={initialize} />
      
      <View style={{ marginTop: 10 }}>
        <Button 
          title="Detect (Default)" 
          onPress={() => detectSpeech('default')}
          disabled={!vadContextRef.current}
        />
        <Button 
          title="Detect (Sensitive)" 
          onPress={() => detectSpeech('sensitive')}
          disabled={!vadContextRef.current}
        />
        <Button 
          title="Detect (Conservative)" 
          onPress={() => detectSpeech('conservative')}
          disabled={!vadContextRef.current}
        />
      </View>

      <View style={{ marginTop: 20 }}>
        <Text>Logs:</Text>
        {logs.map((log, i) => (
          <Text key={i}>{log}</Text>
        ))}
      </View>

      {segments.length > 0 && (
        <View style={{ marginTop: 20 }}>
          <Text>Detected Speech Segments:</Text>
          {segments.map((segment, i) => {
            const duration = ((segment.t1 - segment.t0) / 100).toFixed(2);
            return (
              <Text key={i}>
                {i + 1}. {segment.t0}ms - {segment.t1}ms ({duration}s)
              </Text>
            );
          })}
        </View>
      )}
    </ScrollView>
  );
}

Using VAD with Recorded Audio

You can also detect speech in recorded audio data:

import { Buffer } from 'buffer';
import LiveAudioStream from '@fugood/react-native-audio-pcm-stream';

// Record audio
const recordedData = new Uint8Array(); // Your recorded PCM data

// Convert to base64
const base64Data = Buffer.from(recordedData).toString('base64');

// Detect speech in recorded data
const segments = await vadContext.detectSpeechData(base64Data, {
  threshold: 0.5,
  minSpeechDurationMs: 250,
  minSilenceDurationMs: 100,
  maxSpeechDurationS: 30,
  speechPadMs: 30,
  samplesOverlap: 0.1,
});

console.log(`Detected ${segments.length} speech segments in recorded audio`);

VAD + Transcription Workflow

Combine VAD with transcription for efficient processing:

import { initWhisper, initWhisperVad } from 'whisper.rn';

// Initialize both contexts
const whisperCtx = await initWhisper({
  filePath: require('../assets/ggml-base.bin'),
});

const vadCtx = await initWhisperVad({
  filePath: require('../assets/ggml-silero-v6.2.0.bin'),
  useGpu: true,
  nThreads: 4,
});

// Detect speech segments
const segments = await vadCtx.detectSpeech(audioFile, {
  threshold: 0.5,
  minSpeechDurationMs: 250,
  minSilenceDurationMs: 100,
});

if (segments.length === 0) {
  console.log('No speech detected');
} else {
  // Transcribe the full audio (VAD results can guide post-processing)
  const { promise } = whisperCtx.transcribe(audioFile, {
    language: 'en',
  });
  
  const { result } = await promise;
  console.log('Transcription:', result);
  console.log(`Speech detected in ${segments.length} segments`);
}

// Cleanup
await vadCtx.release();
await whisperCtx.release();

Performance Tips

Model Size: The Silero VAD model is only ~350KB and loads very quickly.

GPU Acceleration: Enable useGpu: true for faster processing on iOS devices.

Thread Count: Use 4 threads for optimal VAD performance on most devices.

Next Steps

Realtime Streaming

Use VAD with realtime transcription for automatic speech detection

Basic Transcription

Learn basic audio file transcription

File Handling

Work with different audio formats and data sources

API Reference

Full API documentation for WhisperVadContext

Get Started

Core Concepts

Features

Platform Guides

Examples

Advanced

Resources

Voice Activity Detection

Overview

Quick Start

VAD Configuration Options

Default Settings

Sensitive Detection

Conservative Detection

Complete Example

Using VAD with Recorded Audio

VAD + Transcription Workflow

Performance Tips

Next Steps

Realtime Streaming

Basic Transcription

File Handling

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Features

Platform Guides

Examples

Advanced

Resources

​Overview

​Quick Start

​VAD Configuration Options

​Default Settings

​Sensitive Detection

​Conservative Detection

​Complete Example

​Using VAD with Recorded Audio

​VAD + Transcription Workflow

​Performance Tips

​Next Steps

Realtime Streaming

Basic Transcription

File Handling

API Reference

Build docs developers (and LLMs) love

Overview

Quick Start

VAD Configuration Options

Default Settings

Sensitive Detection

Conservative Detection

Complete Example

Using VAD with Recorded Audio

VAD + Transcription Workflow

Performance Tips

Next Steps