This feature is coming in version 0.7.0 and is not yet available in the current release.
Overview
Voice Activity Detection (VAD) will enable real-time detection of speech vs. silence in audio streams. This is essential for:
- Automatic silence removal in recordings
- Speech segmentation before transcription
- Reducing unnecessary processing during silent periods
- Triggering speech recognition only when needed
Planned Features
Real-time Detection
Detect voice activity as audio streams in
Low Latency
Minimal processing delay for responsive apps
Silence Removal
Automatically skip non-speech segments
Speech Segmentation
Split audio into speech and non-speech regions
Expected API (Preview)
While the API is not finalized, the expected interface will be:
import { createVAD } from 'react-native-sherpa-onnx/vad';
// Create VAD engine
const vad = await createVAD({
modelPath: { type: 'asset', path: 'models/silero-vad' },
sampleRate: 16000,
windowSize: 512,
});
// Process audio chunks
const isSpeech = await vad.detectSpeech(samples);
if (isSpeech) {
// Forward to STT or other processing
processAudio(samples);
}
// Cleanup
await vad.destroy();
Use Cases
1. Efficient Recording
Only save or process audio segments containing speech:
// Planned API
const recorder = startRecording();
recorder.on('chunk', async (samples) => {
const isSpeech = await vad.detectSpeech(samples);
if (isSpeech) {
// Only process speech segments
await processAudioChunk(samples);
}
});
2. Pre-processing for STT
Segment continuous audio before transcription:
// Planned API
const segments = await vad.segmentAudio(audioFile);
for (const segment of segments) {
if (segment.isSpeech) {
const result = await stt.transcribeSamples(
segment.samples,
segment.sampleRate
);
console.log(result.text);
}
}
3. Wake Word Detection
Trigger STT only when speech is detected:
// Planned API
const stream = await createAudioStream();
stream.on('data', async (samples) => {
const isSpeech = await vad.detectSpeech(samples);
if (isSpeech) {
// Start transcription
await sttStream.acceptWaveform(samples, 16000);
}
});
Planned Configuration
// Expected configuration options
interface VADConfig {
modelPath: ModelPathConfig;
sampleRate: number; // 8000, 16000 (default), 32000, 48000
windowSize: number; // Samples per window (e.g., 512, 1024)
threshold: number; // Speech confidence threshold (0..1)
minSpeechDuration: number; // Minimum speech length (ms)
minSilenceDuration: number; // Minimum silence to split (ms)
}
Expected Models
Likely model support:
- Silero VAD - Lightweight, efficient, ONNX-based
- WebRTC VAD - Classic algorithm
- Custom models - Via sherpa-onnx framework
Timeline
VAD support is planned for:
Version 0.7.0
Initial VAD implementation with basic detection
Future versions
Advanced features like adaptive thresholds and multi-language support
Stay Updated
To track progress or contribute:
Current Workarounds
While VAD is not available, you can:
- Use streaming STT with endpoint detection - The streaming STT API already includes basic endpoint detection
- External libraries - Use JavaScript audio analysis libraries
- Manual silence detection - Implement simple amplitude-based detection
Simple Amplitude Detection
function detectSilence(samples: number[], threshold: number = 0.01): boolean {
const rms = Math.sqrt(
samples.reduce((sum, val) => sum + val * val, 0) / samples.length
);
return rms < threshold;
}
// Usage
const samples = getPcmSamples();
const isSilent = detectSilence(samples);
if (!isSilent) {
// Process audio
}
Streaming STT
Real-time transcription with endpoint detection
Speech Enhancement
Noise reduction (coming in v0.5.0)