Skip to main content
This feature is coming in version 0.7.0 and is not yet available in the current release.

Overview

Voice Activity Detection (VAD) will enable real-time detection of speech vs. silence in audio streams. This is essential for:
  • Automatic silence removal in recordings
  • Speech segmentation before transcription
  • Reducing unnecessary processing during silent periods
  • Triggering speech recognition only when needed

Planned Features

Real-time Detection

Detect voice activity as audio streams in

Low Latency

Minimal processing delay for responsive apps

Silence Removal

Automatically skip non-speech segments

Speech Segmentation

Split audio into speech and non-speech regions

Expected API (Preview)

While the API is not finalized, the expected interface will be:
import { createVAD } from 'react-native-sherpa-onnx/vad';

// Create VAD engine
const vad = await createVAD({
  modelPath: { type: 'asset', path: 'models/silero-vad' },
  sampleRate: 16000,
  windowSize: 512,
});

// Process audio chunks
const isSpeech = await vad.detectSpeech(samples);

if (isSpeech) {
  // Forward to STT or other processing
  processAudio(samples);
}

// Cleanup
await vad.destroy();

Use Cases

1. Efficient Recording

Only save or process audio segments containing speech:
// Planned API
const recorder = startRecording();

recorder.on('chunk', async (samples) => {
  const isSpeech = await vad.detectSpeech(samples);
  
  if (isSpeech) {
    // Only process speech segments
    await processAudioChunk(samples);
  }
});

2. Pre-processing for STT

Segment continuous audio before transcription:
// Planned API
const segments = await vad.segmentAudio(audioFile);

for (const segment of segments) {
  if (segment.isSpeech) {
    const result = await stt.transcribeSamples(
      segment.samples,
      segment.sampleRate
    );
    console.log(result.text);
  }
}

3. Wake Word Detection

Trigger STT only when speech is detected:
// Planned API
const stream = await createAudioStream();

stream.on('data', async (samples) => {
  const isSpeech = await vad.detectSpeech(samples);
  
  if (isSpeech) {
    // Start transcription
    await sttStream.acceptWaveform(samples, 16000);
  }
});

Planned Configuration

// Expected configuration options
interface VADConfig {
  modelPath: ModelPathConfig;
  sampleRate: number;        // 8000, 16000 (default), 32000, 48000
  windowSize: number;        // Samples per window (e.g., 512, 1024)
  threshold: number;         // Speech confidence threshold (0..1)
  minSpeechDuration: number; // Minimum speech length (ms)
  minSilenceDuration: number; // Minimum silence to split (ms)
}

Expected Models

Likely model support:
  • Silero VAD - Lightweight, efficient, ONNX-based
  • WebRTC VAD - Classic algorithm
  • Custom models - Via sherpa-onnx framework

Timeline

VAD support is planned for:
1

Version 0.7.0

Initial VAD implementation with basic detection
2

Future versions

Advanced features like adaptive thresholds and multi-language support

Stay Updated

To track progress or contribute:

Current Workarounds

While VAD is not available, you can:
  1. Use streaming STT with endpoint detection - The streaming STT API already includes basic endpoint detection
  2. External libraries - Use JavaScript audio analysis libraries
  3. Manual silence detection - Implement simple amplitude-based detection

Simple Amplitude Detection

function detectSilence(samples: number[], threshold: number = 0.01): boolean {
  const rms = Math.sqrt(
    samples.reduce((sum, val) => sum + val * val, 0) / samples.length
  );
  return rms < threshold;
}

// Usage
const samples = getPcmSamples();
const isSilent = detectSilence(samples);

if (!isSilent) {
  // Process audio
}

Streaming STT

Real-time transcription with endpoint detection

Speech Enhancement

Noise reduction (coming in v0.5.0)

Build docs developers (and LLMs) love