Overview
TheNativeVad class provides Voice Activity Detection (VAD) capabilities for analyzing audio frames to detect speech activity. It’s useful for building voice-activated features, speech detection systems, and audio bots that need to identify when someone is speaking.
VAD is commonly used in conversational AI applications to determine when a user has finished speaking, enabling natural turn-taking in voice interactions.
Creation
Create aNativeVad instance using the static method Daily.create_native_vad():
Parameters
The period in milliseconds after which the VAD state resets. This determines how long the VAD maintains its internal state before resetting.
The audio sample rate in Hz. Must match the sample rate of the audio frames being analyzed. Common values are 8000, 16000, 24000, or 48000.
The number of audio channels. Use 1 for mono, 2 for stereo.
Properties
The configured reset period in milliseconds.
The configured audio sample rate in Hz.
The configured number of audio channels.
Methods
analyze_frames()
Parameters
Raw audio frame data as bytes. The frame should match the configured sample rate and number of channels.
Returns
A confidence score between 0.0 and 1.0 indicating the likelihood of speech being present in the audio frame. Higher values indicate greater confidence that speech is detected.
- 0.0 - 0.5: Likely silence or background noise
- 0.5 - 0.8: Possible speech or ambiguous audio
- 0.8 - 1.0: High confidence speech detected
Usage Example
Here’s a complete example demonstrating speech detection usingNativeVad:
Advanced Example
For a more sophisticated implementation with configurable thresholds and state management, see the native_vad.py demo in the Daily Python SDK repository. The demo includes:- Configurable speech and silence thresholds
- Time-based state transitions for more accurate detection
- Command-line arguments for tuning VAD parameters
- Integration with Daily’s virtual speaker device
Common Use Cases
Voice Bot Turn-Taking
Voice Bot Turn-Taking
Use VAD to detect when a user has finished speaking, allowing your bot to respond at natural breaks in conversation:
Audio Recording Optimization
Audio Recording Optimization
Start and stop recording based on voice activity to save storage and processing:
Transcription Triggering
Transcription Triggering
Only send audio to transcription services when speech is detected:
Tips
The
NativeVad analyzes each frame independently. For production applications, combine VAD confidence scores with time-based thresholds to avoid rapid state changes from brief noise or pauses.Related
- VirtualSpeakerDevice - Read audio from meetings
- Audio Processing Guide - Complete guide to audio handling
- VAD Demo - Full example implementation