VadOptions interface defines parameters for configuring voice activity detection behavior when using the Silero VAD model.
Interface
Properties
Probability threshold to consider audio as speech.
- Range:
0.0to1.0 - Higher values = more conservative (fewer false positives)
- Lower values = more sensitive (may detect more speech)
- Recommended:
0.4to0.7
Minimum duration (in milliseconds) for a valid speech segment.Segments shorter than this will be filtered out as noise.
Minimum silence duration (in milliseconds) to consider speech as ended.Short pauses shorter than this will not split speech segments.
Maximum duration (in seconds) of a speech segment before forcing a new segment.Long continuous speech will be split at this duration to avoid oversized segments.
Padding (in milliseconds) added before and after detected speech segments.Helps capture the beginning and end of speech that might be near the detection threshold.
Overlap (in seconds) when copying audio samples from speech segments.Used internally for processing audio chunks with continuity.
Usage Examples
Default Settings
Custom Configuration
Sensitive Detection
Preset Configurations
VadSegment Return Type
ThedetectSpeech and detectSpeechData methods return an array of VadSegment objects:
Example
Tuning Guidelines
For Noisy Environments
For Quiet, Clear Speech
For Continuous Speech (Lectures, Podcasts)
For Command Words (Short Utterances)
Performance Considerations
- Lower threshold: More segments detected, more processing time
- Higher minSpeechDurationMs: Fewer segments, faster processing
- speechPadMs: Adds to segment duration, increases data to process
- maxSpeechDurationS: Limits segment size, helps memory management
Related
- WhisperVadContext.detectSpeech - Detect speech in files
- WhisperVadContext.detectSpeechData - Detect speech in raw data
- initWhisperVad - Initialize VAD context