Skip to main content
Realtime transcription enables live speech-to-text from microphone input with automatic audio slicing, VAD-based speech detection, and memory management.

Overview

The RealtimeTranscriber provides:
  • Live microphone capture and transcription
  • Voice Activity Detection (VAD) for speech/silence detection
  • Automatic audio slicing at configurable intervals
  • Memory-efficient circular buffer management
  • Event-based architecture for transcription updates
  • Audio recording to WAV files

Quick Start

1

Initialize Contexts

First, initialize both Whisper and VAD contexts:
import { initWhisper, initWhisperVad } from 'whisper.rn';

// Initialize Whisper context
const whisperContext = await initWhisper({
  filePath: require('../assets/ggml-base.bin'),
});

// Initialize VAD context
const vadContext = await initWhisperVad({
  filePath: require('../assets/ggml-silero-v6.2.0.bin'),
  useGpu: true,
  nThreads: 4,
});

console.log('Contexts initialized');
2

Create RealtimeTranscriber

Set up the transcriber with dependencies, options, and callbacks:
import RNFS from 'react-native-fs';
import {
  RealtimeTranscriber,
  RingBufferVad,
  VAD_PRESETS,
  AudioPcmStreamAdapter,
} from 'whisper.rn/realtime-transcription';

// Create VAD wrapper with preset
const vadWrapper = new RingBufferVad(vadContext, {
  vadOptions: VAD_PRESETS.default,
  vadPreset: 'default',
  logger: (msg) => console.log(msg),
});

// Create audio stream adapter
const audioStream = new AudioPcmStreamAdapter();

// Create transcriber
const transcriber = new RealtimeTranscriber(
  // Dependencies
  {
    whisperContext,
    vadContext: vadWrapper,
    audioStream,
    fs: RNFS,
  },
  // Options
  {
    logger: (msg) => console.log(msg),
    audioSliceSec: 30,
    audioMinSec: 0.5,
    maxSlicesInMemory: 3,
    transcribeOptions: {
      language: 'en',
      maxLen: 1,
    },
    audioOutputPath: `${RNFS.DocumentDirectoryPath}/recording.wav`,
  },
  // Callbacks
  {
    onTranscribe: (event) => {
      console.log('Transcription:', event.data?.result);
    },
    onVad: (event) => {
      console.log('VAD:', event.type, event.confidence);
    },
    onError: (error) => {
      console.error('Error:', error);
    },
    onStatusChange: (isActive) => {
      console.log('Status:', isActive ? 'ACTIVE' : 'INACTIVE');
    },
    onStatsUpdate: (stats) => {
      console.log('Stats:', stats.data);
    },
  }
);
3

Start Transcription

Start realtime transcription:
await transcriber.start();
console.log('Realtime transcription started');
4

Stop and Cleanup

Stop transcription and release resources:
await transcriber.stop();
await transcriber.release();

// Release contexts
await whisperContext.release();
await vadContext.release();

VAD Presets

The library includes pre-configured VAD presets for different use cases:
Balanced settings for general use:
VAD_PRESETS.default = {
  threshold: 0.5,
  minSpeechDurationMs: 250,
  minSilenceDurationMs: 100,
  maxSpeechDurationS: 30,
  speechPadMs: 30,
  samplesOverlap: 0.1,
}

Event Callbacks

The transcriber provides several event callbacks:

onTranscribe

Receives transcription results:
onTranscribe: (event: RealtimeTranscribeEvent) => {
  const { data, sliceIndex, processTime } = event;
  
  if (data?.result) {
    console.log(`Slice ${sliceIndex}: ${data.result}`);
    console.log(`Processed in ${processTime}ms`);
    
    // Access all segments
    data.segments.forEach((segment) => {
      console.log(`[${segment.t0} --> ${segment.t1}] ${segment.text}`);
    });
  }
}

onVad

Receives VAD events (speech start, speech end, silence):
onVad: (event: RealtimeVadEvent) => {
  console.log(`VAD: ${event.type} (confidence: ${event.confidence.toFixed(2)})`);
  
  if (event.type === 'speech_start') {
    console.log('Speech detected!');
  } else if (event.type === 'speech_end') {
    console.log('Speech ended, triggering slice...');
  }
}

onSliceTranscriptionStabilized

Receives the most recent stabilized transcription:
onSliceTranscriptionStabilized: (text: string) => {
  console.log('Stabilized text:', text);
  // Update UI with current transcription
  setCurrentTranscription(text);
}

onStatsUpdate

Receives statistics about memory usage and processing:
onStatsUpdate: (stats: RealtimeStatsEvent) => {
  const { data } = stats;
  console.log('Slices in memory:', data.sliceStats?.memoryUsage?.slicesInMemory);
  console.log('Memory usage:', data.sliceStats?.memoryUsage?.estimatedMB, 'MB');
  console.log('Is transcribing:', data.isTranscribing);
}

Complete Example

Here’s a complete React Native component with realtime transcription:
Complete Realtime Example
import React, { useCallback, useEffect, useRef, useState } from 'react';
import { View, Text, Button, ScrollView, Switch } from 'react-native';
import RNFS from 'react-native-fs';
import { initWhisper, initWhisperVad } from 'whisper.rn';
import type { WhisperContext, WhisperVadContext } from 'whisper.rn';
import {
  RealtimeTranscriber,
  RingBufferVad,
  VAD_PRESETS,
  AudioPcmStreamAdapter,
  type RealtimeTranscribeEvent,
  type RealtimeVadEvent,
} from 'whisper.rn/realtime-transcription';

export default function RealtimeTranscription() {
  const whisperContextRef = useRef<WhisperContext | null>(null);
  const vadContextRef = useRef<WhisperVadContext | null>(null);
  const transcriberRef = useRef<RealtimeTranscriber | null>(null);

  const [logs, setLogs] = useState<string[]>([]);
  const [currentText, setCurrentText] = useState<string>('');
  const [isTranscribing, setIsTranscribing] = useState(false);
  const [vadPreset, setVadPreset] = useState<keyof typeof VAD_PRESETS>('default');

  const log = useCallback((...messages: any[]) => {
    const timestamp = new Date().toLocaleTimeString();
    setLogs((prev) => [...prev, `${timestamp}: ${messages.join(' ')}`]);
  }, []);

  useEffect(() => {
    return () => {
      whisperContextRef.current?.release();
      vadContextRef.current?.release();
      transcriberRef.current?.release();
    };
  }, []);

  const initialize = async () => {
    try {
      log('Initializing contexts...');
      
      // Initialize Whisper
      const whisperCtx = await initWhisper({
        filePath: require('../assets/ggml-base.bin'),
      });
      whisperContextRef.current = whisperCtx;
      log('Whisper initialized');

      // Initialize VAD
      const vadCtx = await initWhisperVad({
        filePath: require('../assets/ggml-silero-v6.2.0.bin'),
        useGpu: true,
        nThreads: 4,
      });
      vadContextRef.current = vadCtx;
      log('VAD initialized');
    } catch (error) {
      log('Error initializing:', error);
    }
  };

  const startTranscription = async () => {
    if (!whisperContextRef.current || !vadContextRef.current) {
      log('Contexts not initialized');
      return;
    }

    try {
      const audioStream = new AudioPcmStreamAdapter();
      
      const vadWrapper = new RingBufferVad(vadContextRef.current, {
        vadOptions: VAD_PRESETS[vadPreset],
        vadPreset,
        logger: (msg) => console.log(msg),
      });

      const transcriber = new RealtimeTranscriber(
        {
          whisperContext: whisperContextRef.current,
          vadContext: vadWrapper,
          audioStream,
          fs: RNFS,
        },
        {
          logger: (msg) => log(msg),
          audioSliceSec: 30,
          audioMinSec: 0.5,
          maxSlicesInMemory: 3,
          transcribeOptions: {
            language: 'en',
            maxLen: 1,
          },
          audioOutputPath: `${RNFS.DocumentDirectoryPath}/realtime.wav`,
        },
        {
          onTranscribe: (event: RealtimeTranscribeEvent) => {
            if (event.data?.result) {
              log(`Transcribed: "${event.data.result.substring(0, 50)}..."`);
            }
          },
          onVad: (event: RealtimeVadEvent) => {
            if (event.type !== 'silence') {
              log(`VAD: ${event.type}`);
            }
          },
          onError: (error) => log('Error:', error),
          onStatusChange: (isActive) => setIsTranscribing(isActive),
          onSliceTranscriptionStabilized: (text) => setCurrentText(text),
        }
      );

      transcriberRef.current = transcriber;
      await transcriber.start();
      log('Realtime transcription started');
    } catch (error) {
      log('Error starting transcription:', error);
    }
  };

  const stopTranscription = async () => {
    if (!transcriberRef.current) return;

    try {
      await transcriberRef.current.stop();
      log('Transcription stopped');
    } catch (error) {
      log('Error stopping:', error);
    }
  };

  return (
    <ScrollView style={{ padding: 20 }}>
      <Button title="Initialize" onPress={initialize} />
      
      <View style={{ marginTop: 10 }}>
        <Text>VAD Preset: {vadPreset}</Text>
        <Button
          title="Change VAD Preset"
          onPress={() => {
            const presets = Object.keys(VAD_PRESETS) as Array<keyof typeof VAD_PRESETS>;
            const currentIndex = presets.indexOf(vadPreset);
            const nextPreset = presets[(currentIndex + 1) % presets.length];
            setVadPreset(nextPreset);
            log(`Changed VAD preset to: ${nextPreset}`);
          }}
        />
      </View>

      <View style={{ marginTop: 10 }}>
        <Button
          title={isTranscribing ? 'Stop' : 'Start Realtime'}
          onPress={isTranscribing ? stopTranscription : startTranscription}
          disabled={!whisperContextRef.current}
        />
      </View>

      {currentText && (
        <View style={{ marginTop: 20, padding: 10, backgroundColor: '#e8f5e8' }}>
          <Text style={{ fontWeight: 'bold' }}>Current Transcription:</Text>
          <Text>{currentText}</Text>
        </View>
      )}

      <View style={{ marginTop: 20 }}>
        <Text style={{ fontWeight: 'bold' }}>Logs:</Text>
        {logs.slice(-10).map((log, i) => (
          <Text key={i} style={{ fontSize: 12 }}>{log}</Text>
        ))}
      </View>
    </ScrollView>
  );
}

File Simulation Mode

Test realtime transcription using pre-recorded audio files:
import { SimulateFileAudioStreamAdapter } from 'whisper.rn/realtime-transcription/adapters';

const audioStream = new SimulateFileAudioStreamAdapter({
  fs: RNFS,
  filePath: '/path/to/audio.wav',
  playbackSpeed: 1.0, // 1x speed, can go faster for testing
  chunkDurationMs: 100,
  loop: false,
  onEndOfFile: () => {
    console.log('File playback complete');
  },
  logger: (msg) => console.log(msg),
});

// Use with RealtimeTranscriber
const transcriber = new RealtimeTranscriber(
  {
    whisperContext,
    vadContext: vadWrapper,
    audioStream, // File simulation adapter
    fs: RNFS,
  },
  { /* options */ },
  { /* callbacks */ }
);

Advanced Features

Force Next Slice

Manually trigger a slice during transcription:
await transcriber.nextSlice();
console.log('Forced next slice');

Update VAD Options

Change VAD settings during transcription:
transcriber.updateVadOptions(VAD_PRESETS.sensitive);

Reset Transcriber

Clear all state without stopping:
transcriber.reset();
console.log('Transcriber reset');

Get Transcription Results

Retrieve all transcription results:
const results = transcriber.getTranscriptionResults();
results.forEach(({ slice, transcribeEvent }) => {
  console.log(`Slice ${slice.index}: ${transcribeEvent.data?.result}`);
});

Performance Tips

Slice Duration: 30 seconds is optimal for most cases. Shorter slices = more frequent processing, longer slices = higher memory usage.
Memory Management: Set maxSlicesInMemory: 3 to keep memory usage low. Older slices are automatically discarded.
VAD Preset: Start with ‘default’, switch to ‘sensitive’ for quiet environments or ‘conservative’ for noisy environments.
Model Selection: Use ‘tiny’ or ‘base’ models for realtime. Larger models may cause lag on lower-end devices.

Next Steps

Basic Transcription

Learn basic audio file transcription

VAD Detection

Understand Voice Activity Detection

File Handling

Work with different audio formats

API Reference

Full RealtimeTranscriber API documentation

Build docs developers (and LLMs) love