Voice Activity Detection

The useVAD hook provides Voice Activity Detection (VAD) to identify when speech is present in audio streams. It returns precise start and end timestamps for each speech segment, making it essential for voice-activated features and efficient audio processing.

Basic Usage

import { useVAD } from 'react-native-executorch';

function VoiceDetector() {
  const { forward, isReady, error } = useVAD({
    model: {
      modelSource: require('./models/vad-model.pte'),
    },
  });

  const detectSpeech = async (audioBuffer: Float32Array) => {
    if (!isReady) return;

    const segments = await forward(audioBuffer);

    console.log(`Found ${segments.length} speech segments`);
    segments.forEach((segment) => {
      console.log(`Speech from ${segment.start}s to ${segment.end}s`);
      console.log(`Duration: ${segment.end - segment.start}s`);
    });
  };

  return (
    <View>
      {error && <Text>Error: {error.message}</Text>}
      <Button onPress={detectSpeech} title="Detect Speech" disabled={!isReady} />
    </View>
  );
}

Hook Signature

useVAD(props)

function useVAD(props: VADProps): VADType;

Parameters

model

object

required

Model configuration object.

Show properties

modelSource

ResourceSource

required

Location of the VAD model .pte file. Can be a URL (string), local file (require), or resource ID (number).

preventLoad

boolean

default:"false"

Prevent automatic model loading on mount. Useful for lazy loading scenarios.

Returns

error

RnExecutorchError | null

Contains error details if model loading or inference fails.

isReady

boolean

Indicates whether the model has loaded successfully and is ready for detection.

isGenerating

boolean

Indicates whether a detection is currently in progress.

downloadProgress

number

Download progress as a value between 0 and 1.

forward

(waveform: Float32Array) => Promise<Segment[]>

Detect speech segments in the provided audio waveform. Returns an array of segments with start and end times.

Detection Method

forward(waveform)

Detect speech activity in audio:

const { forward, isReady } = useVAD({ model });

// Audio must be 16kHz mono
const segments = await forward(audioBuffer);

segments.forEach((segment) => {
  console.log(`Speech: ${segment.start}s - ${segment.end}s`);
});

Types

Segment

Represents a detected speech segment:

interface Segment {
  start: number; // Start time in seconds
  end: number; // End time in seconds
}

VADProps

Configuration for the VAD hook:

interface VADProps {
  model: {
    modelSource: ResourceSource;
  };
  preventLoad?: boolean;
}

VADType

Return type of the useVAD hook:

interface VADType {
  error: RnExecutorchError | null;
  isReady: boolean;
  isGenerating: boolean;
  downloadProgress: number;
  forward(waveform: Float32Array): Promise<Segment[]>;
}

Audio Format Requirements

Audio must be in the correct format or detection will fail.

Sample rate: 16kHz (16,000 samples per second)
Channels: Mono (single channel)
Data type: Float32Array
Value range: -1.0 to 1.0 (normalized)
Buffer layout: Contiguous samples in time order

Converting Audio

Example of preparing audio for VAD:

function prepareAudioForVAD(audioBuffer: AudioBuffer): Float32Array {
  // Resample to 16kHz
  const targetSampleRate = 16000;
  const resampled = resampleAudio(audioBuffer, targetSampleRate);

  // Convert to mono by averaging channels
  let mono: Float32Array;
  if (audioBuffer.numberOfChannels === 1) {
    mono = resampled.getChannelData(0);
  } else {
    const left = resampled.getChannelData(0);
    const right = resampled.getChannelData(1);
    mono = new Float32Array(left.length);
    for (let i = 0; i < left.length; i++) {
      mono[i] = (left[i] + right[i]) / 2;
    }
  }

  // Normalize to [-1.0, 1.0]
  const normalized = new Float32Array(mono.length);
  for (let i = 0; i < mono.length; i++) {
    normalized[i] = Math.max(-1, Math.min(1, mono[i]));
  }

  return normalized;
}

Advanced Usage

Processing Long Audio Files

For long recordings, process in chunks to manage memory:

const { forward } = useVAD({ model });

const detectInLongAudio = async (
  audioBuffer: Float32Array,
  chunkDuration: number = 30 // 30 seconds per chunk
) => {
  const sampleRate = 16000;
  const chunkSize = chunkDuration * sampleRate;
  const allSegments: Segment[] = [];

  for (let i = 0; i < audioBuffer.length; i += chunkSize) {
    const chunk = audioBuffer.slice(i, Math.min(i + chunkSize, audioBuffer.length));
    const segments = await forward(chunk);

    // Adjust timestamps to account for chunk offset
    const offset = i / sampleRate;
    const adjustedSegments = segments.map((seg) => ({
      start: seg.start + offset,
      end: seg.end + offset,
    }));

    allSegments.push(...adjustedSegments);
  }

  return allSegments;
};

Real-Time Stream Processing

Detect speech in live audio streams:

function LiveVAD() {
  const { forward, isReady } = useVAD({ model });
  const [activeSpeech, setActiveSpeech] = useState(false);
  const bufferRef = useRef<Float32Array[]>([]);

  const processAudioChunk = async (chunk: Float32Array) => {
    if (!isReady) return;

    // Accumulate chunks
    bufferRef.current.push(chunk);

    // Process every second of audio
    const totalSamples = bufferRef.current.reduce((sum, buf) => sum + buf.length, 0);
    if (totalSamples >= 16000) {
      // Concatenate buffers
      const combined = new Float32Array(totalSamples);
      let offset = 0;
      for (const buf of bufferRef.current) {
        combined.set(buf, offset);
        offset += buf.length;
      }

      // Detect speech
      const segments = await forward(combined);
      setActiveSpeech(segments.length > 0);

      // Clear buffer
      bufferRef.current = [];
    }
  };

  return (
    <View>
      <Text>Speech Active: {activeSpeech ? 'Yes' : 'No'}</Text>
    </View>
  );
}

Extracting Speech Segments

Extract only the speech portions from audio:

const extractSpeechSegments = async (
  audioBuffer: Float32Array,
  segments: Segment[]
): Promise<Float32Array[]> => {
  const sampleRate = 16000;
  const speechChunks: Float32Array[] = [];

  for (const segment of segments) {
    const startSample = Math.floor(segment.start * sampleRate);
    const endSample = Math.floor(segment.end * sampleRate);
    const chunk = audioBuffer.slice(startSample, endSample);
    speechChunks.push(chunk);
  }

  return speechChunks;
};

// Usage
const { forward } = useVAD({ model });
const segments = await forward(audioBuffer);
const speechOnly = await extractSpeechSegments(audioBuffer, segments);

// Process only speech segments
for (const speechChunk of speechOnly) {
  await transcribe(speechChunk);
}

Filtering Short Segments

Remove brief noise detections:

const filterShortSegments = (
  segments: Segment[],
  minDuration: number = 0.3 // 300ms minimum
): Segment[] => {
  return segments.filter((seg) => seg.end - seg.start >= minDuration);
};

// Usage
const { forward } = useVAD({ model });
const allSegments = await forward(audioBuffer);
const speechSegments = filterShortSegments(allSegments, 0.5); // Only segments >= 500ms

Merging Adjacent Segments

Combine segments with small gaps:

const mergeAdjacentSegments = (
  segments: Segment[],
  maxGap: number = 0.5 // 500ms maximum gap
): Segment[] => {
  if (segments.length === 0) return [];

  const merged: Segment[] = [];
  let current = { ...segments[0] };

  for (let i = 1; i < segments.length; i++) {
    const gap = segments[i].start - current.end;

    if (gap <= maxGap) {
      // Merge with current segment
      current.end = segments[i].end;
    } else {
      // Save current and start new segment
      merged.push(current);
      current = { ...segments[i] };
    }
  }

  merged.push(current);
  return merged;
};

// Usage
const segments = await forward(audioBuffer);
const cleanSegments = mergeAdjacentSegments(
  filterShortSegments(segments, 0.3),
  0.5
);

Integration Examples

VAD + Speech to Text

Optimize transcription by processing only speech:

import { useVAD } from 'react-native-executorch';
import { useSpeechToText } from 'react-native-executorch';

function SmartTranscription() {
  const vad = useVAD({ model: vadModel });
  const stt = useSpeechToText({ model: sttModel });

  const transcribeWithVAD = async (audioBuffer: Float32Array) => {
    // Detect speech segments
    const segments = await vad.forward(audioBuffer);
    console.log(`Found ${segments.length} speech segments`);

    // Extract and transcribe only speech portions
    const sampleRate = 16000;
    const transcriptions: string[] = [];

    for (const segment of segments) {
      const startSample = Math.floor(segment.start * sampleRate);
      const endSample = Math.floor(segment.end * sampleRate);
      const speechChunk = audioBuffer.slice(startSample, endSample);

      const result = await stt.transcribe(speechChunk);
      transcriptions.push(result.text);
    }

    return transcriptions.join(' ');
  };

  return <TranscriptionUI onTranscribe={transcribeWithVAD} />;
}

Voice Command Detection

Trigger actions when speech is detected:

function VoiceCommandListener() {
  const { forward, isReady } = useVAD({ model });
  const [listening, setListening] = useState(false);

  const startListening = async () => {
    setListening(true);

    // Continuously monitor audio
    const audioStream = await startAudioCapture();

    for await (const chunk of audioStream) {
      const segments = await forward(chunk);

      if (segments.length > 0) {
        // Speech detected - process command
        await handleVoiceCommand(chunk);
      }
    }
  };

  return (
    <Button
      onPress={startListening}
      title={listening ? 'Listening...' : 'Start Listening'}
      disabled={!isReady}
    />
  );
}

Audio Visualization

Visualize speech activity:

function SpeechVisualizer({ audioBuffer }: { audioBuffer: Float32Array }) {
  const { forward } = useVAD({ model });
  const [segments, setSegments] = useState<Segment[]>([]);

  useEffect(() => {
    const detectSegments = async () => {
      const detected = await forward(audioBuffer);
      setSegments(detected);
    };
    detectSegments();
  }, [audioBuffer]);

  const duration = audioBuffer.length / 16000; // Total duration in seconds

  return (
    <View style={{ flexDirection: 'row', height: 50 }}>
      {segments.map((seg, idx) => {
        const left = (seg.start / duration) * 100;
        const width = ((seg.end - seg.start) / duration) * 100;

        return (
          <View
            key={idx}
            style={{
              position: 'absolute',
              left: `${left}%`,
              width: `${width}%`,
              height: '100%',
              backgroundColor: 'green',
              opacity: 0.5,
            }}
          />
        );
      })}
    </View>
  );
}

Error Handling

const { forward, error, isReady } = useVAD({ model });

if (error) {
  console.error('VAD Error:', error.message);
}

try {
  const segments = await forward(audioBuffer);
} catch (err) {
  if (err.code === 'MODULE_NOT_LOADED') {
    console.error('Model not ready yet');
  } else if (err.code === 'MODEL_GENERATING') {
    console.error('Already processing audio');
  } else {
    console.error('Detection failed:', err.message);
  }
}

Best Practices

Audio Quality: Clean audio with minimal background noise produces better results.
Segment Filtering: Always filter out very short segments (< 300ms) which are often noise.
Segment Merging: Merge segments with small gaps to avoid fragmenting continuous speech.
Buffer Size: Process at least 1-2 seconds of audio for reliable detection.
Memory Management: For long recordings, process in chunks and clear buffers regularly.
Real-Time Processing: Accumulate small chunks (100-200ms) before running VAD to reduce overhead.
Combined Workflows: Use VAD before STT to reduce computational cost and improve accuracy.

Performance Tips

Batch Processing: Process multiple seconds at once rather than very small chunks.
Async Processing: Run VAD asynchronously to avoid blocking the UI thread.
Cache Model: The model is cached after first load, making subsequent uses faster.
Threshold Tuning: Experiment with minimum segment duration for your use case.

Common Use Cases

Meeting Recorder

function MeetingRecorder() {
  const { forward } = useVAD({ model });
  const [speakers, setSpeakers] = useState<Segment[]>([]);

  const analyzeMeeting = async (recording: Float32Array) => {
    const segments = await forward(recording);
    const filtered = filterShortSegments(segments, 1.0); // 1s minimum
    const merged = mergeAdjacentSegments(filtered, 2.0); // 2s max gap

    setSpeakers(merged);
    return merged;
  };

  return (
    <View>
      <Text>Speech Segments: {speakers.length}</Text>
      {speakers.map((seg, idx) => (
        <Text key={idx}>
          Speaker {idx + 1}: {seg.start.toFixed(1)}s - {seg.end.toFixed(1)}s
        </Text>
      ))}
    </View>
  );
}

Silence Detection

const detectSilence = async (
  audioBuffer: Float32Array,
  vad: VADType
): Promise<Segment[]> => {
  const segments = await vad.forward(audioBuffer);
  const duration = audioBuffer.length / 16000;
  const silenceSegments: Segment[] = [];

  // Before first speech
  if (segments.length > 0 && segments[0].start > 0) {
    silenceSegments.push({ start: 0, end: segments[0].start });
  }

  // Between speech segments
  for (let i = 0; i < segments.length - 1; i++) {
    silenceSegments.push({
      start: segments[i].end,
      end: segments[i + 1].start,
    });
  }

  // After last speech
  if (segments.length > 0 && segments[segments.length - 1].end < duration) {
    silenceSegments.push({
      start: segments[segments.length - 1].end,
      end: duration,
    });
  }

  return silenceSegments;
};

Speech to Text - Transcribe detected speech segments
Speech Overview - Complete guide to speech features

Getting Started

Core Concepts

Large Language Models

Computer Vision

Speech & Audio

Text Embeddings

Advanced

Guides

Basic Usage

Hook Signature

useVAD(props)

Parameters

Returns

Detection Method

forward(waveform)

Types

Segment

VADProps

VADType

Audio Format Requirements

Converting Audio

Advanced Usage

Processing Long Audio Files

Real-Time Stream Processing

Extracting Speech Segments

Filtering Short Segments

Merging Adjacent Segments

Integration Examples

VAD + Speech to Text

Voice Command Detection

Audio Visualization

Error Handling

Best Practices

Performance Tips

Common Use Cases

Meeting Recorder

Silence Detection

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Large Language Models

Computer Vision

Speech & Audio

Text Embeddings

Advanced

Guides

​Basic Usage

​Hook Signature

​useVAD(props)

​Parameters

​Returns

​Detection Method

​forward(waveform)

​Types

​Segment

​VADProps

​VADType

​Audio Format Requirements

​Converting Audio

​Advanced Usage

​Processing Long Audio Files

​Real-Time Stream Processing

​Extracting Speech Segments

​Filtering Short Segments

​Merging Adjacent Segments

​Integration Examples

​VAD + Speech to Text

​Voice Command Detection

​Audio Visualization

​Error Handling

​Best Practices

​Performance Tips

​Common Use Cases

​Meeting Recorder

​Silence Detection

​Related

Build docs developers (and LLMs) love

Basic Usage

Hook Signature

useVAD(props)

Parameters

Returns

Detection Method

forward(waveform)

Types

Segment

VADProps

VADType

Audio Format Requirements

Converting Audio

Advanced Usage

Processing Long Audio Files

Real-Time Stream Processing

Extracting Speech Segments

Filtering Short Segments

Merging Adjacent Segments

Integration Examples

VAD + Speech to Text

Voice Command Detection

Audio Visualization

Error Handling

Best Practices

Performance Tips

Common Use Cases

Meeting Recorder

Silence Detection

Related