Voice Activity Detection

This feature is coming in version 0.7.0 and is not yet available in the current release.

Overview

Voice Activity Detection (VAD) will enable real-time detection of speech vs. silence in audio streams. This is essential for:

Automatic silence removal in recordings
Speech segmentation before transcription
Reducing unnecessary processing during silent periods
Triggering speech recognition only when needed

Planned Features

Real-time Detection

Detect voice activity as audio streams in

Low Latency

Minimal processing delay for responsive apps

Silence Removal

Automatically skip non-speech segments

Speech Segmentation

Split audio into speech and non-speech regions

Expected API (Preview)

While the API is not finalized, the expected interface will be:

import { createVAD } from 'react-native-sherpa-onnx/vad';

// Create VAD engine
const vad = await createVAD({
  modelPath: { type: 'asset', path: 'models/silero-vad' },
  sampleRate: 16000,
  windowSize: 512,
});

// Process audio chunks
const isSpeech = await vad.detectSpeech(samples);

if (isSpeech) {
  // Forward to STT or other processing
  processAudio(samples);
}

// Cleanup
await vad.destroy();

Use Cases

1. Efficient Recording

Only save or process audio segments containing speech:

// Planned API
const recorder = startRecording();

recorder.on('chunk', async (samples) => {
  const isSpeech = await vad.detectSpeech(samples);
  
  if (isSpeech) {
    // Only process speech segments
    await processAudioChunk(samples);
  }
});

2. Pre-processing for STT

Segment continuous audio before transcription:

// Planned API
const segments = await vad.segmentAudio(audioFile);

for (const segment of segments) {
  if (segment.isSpeech) {
    const result = await stt.transcribeSamples(
      segment.samples,
      segment.sampleRate
    );
    console.log(result.text);
  }
}

3. Wake Word Detection

Trigger STT only when speech is detected:

// Planned API
const stream = await createAudioStream();

stream.on('data', async (samples) => {
  const isSpeech = await vad.detectSpeech(samples);
  
  if (isSpeech) {
    // Start transcription
    await sttStream.acceptWaveform(samples, 16000);
  }
});

Planned Configuration

// Expected configuration options
interface VADConfig {
  modelPath: ModelPathConfig;
  sampleRate: number;        // 8000, 16000 (default), 32000, 48000
  windowSize: number;        // Samples per window (e.g., 512, 1024)
  threshold: number;         // Speech confidence threshold (0..1)
  minSpeechDuration: number; // Minimum speech length (ms)
  minSilenceDuration: number; // Minimum silence to split (ms)
}

Expected Models

Likely model support:

Silero VAD - Lightweight, efficient, ONNX-based
WebRTC VAD - Classic algorithm
Custom models - Via sherpa-onnx framework

Timeline

VAD support is planned for:

Version 0.7.0

Initial VAD implementation with basic detection

Future versions

Advanced features like adaptive thresholds and multi-language support

Stay Updated

To track progress or contribute:

Watch the GitHub repository
Check the changelog
Join discussions in issues or PRs

Current Workarounds

While VAD is not available, you can:

Use streaming STT with endpoint detection - The streaming STT API already includes basic endpoint detection
External libraries - Use JavaScript audio analysis libraries
Manual silence detection - Implement simple amplitude-based detection

Simple Amplitude Detection

function detectSilence(samples: number[], threshold: number = 0.01): boolean {
  const rms = Math.sqrt(
    samples.reduce((sum, val) => sum + val * val, 0) / samples.length
  );
  return rms < threshold;
}

// Usage
const samples = getPcmSamples();
const isSilent = detectSilence(samples);

if (!isSilent) {
  // Process audio
}

Streaming STT

Real-time transcription with endpoint detection

Speech Enhancement

Noise reduction (coming in v0.5.0)

Get Started

Core Features

Guides

Platform Specific

Advanced

Overview

Planned Features

Real-time Detection

Low Latency

Silence Removal

Speech Segmentation

Expected API (Preview)

Use Cases

1. Efficient Recording

2. Pre-processing for STT

3. Wake Word Detection

Planned Configuration

Expected Models

Timeline

Stay Updated

Current Workarounds

Simple Amplitude Detection

Streaming STT

Speech Enhancement

Build docs developers (and LLMs) love

Get Started

Core Features

Guides

Platform Specific

Advanced

​Overview

​Planned Features

Real-time Detection

Low Latency

Silence Removal

Speech Segmentation

​Expected API (Preview)

​Use Cases

​1. Efficient Recording

​2. Pre-processing for STT

​3. Wake Word Detection

​Planned Configuration

​Expected Models

​Timeline

​Stay Updated

​Current Workarounds

​Simple Amplitude Detection

​Related Features

Streaming STT

Speech Enhancement

Build docs developers (and LLMs) love

Overview

Planned Features

Expected API (Preview)

Use Cases

1. Efficient Recording

2. Pre-processing for STT

3. Wake Word Detection

Planned Configuration

Expected Models

Timeline

Stay Updated

Current Workarounds

Simple Amplitude Detection

Related Features