Skip to main content
Coming in v0.4.0 - This feature is planned for the next release. The API interface is subject to change.

Overview

Speaker Diarization identifies “who spoke when” in an audio recording, segmenting the audio by speaker. This is useful for:
  • Meeting transcription with speaker labels
  • Call center analytics
  • Interview transcription
  • Multi-speaker content analysis
  • Podcast and video production

Installation

Diarization will be included in the main package:
npm install react-native-sherpa-onnx

Basic Usage

import { initializeDiarization, diarizeAudio } from 'react-native-sherpa-onnx/diarization';

// Initialize diarization with model
await initializeDiarization({
  modelPath: {
    type: 'auto',
    path: 'models/diarization-model'
  }
});

// Diarize audio
const segments = await diarizeAudio('path/to/conversation.wav');

console.log('Speaker segments:', segments);
// [
//   { speakerId: 'speaker_0', start: 0.0, end: 3.5 },
//   { speakerId: 'speaker_1', start: 3.5, end: 7.2 },
//   { speakerId: 'speaker_0', start: 7.2, end: 10.1 }
// ]

API Reference

initializeDiarization()

Initialize the Speaker Diarization model.
await initializeDiarization(options: DiarizationInitializeOptions): Promise<void>

Parameters

options
DiarizationInitializeOptions
required
Configuration options for diarization initialization

Returns

Promise that resolves when diarization is initialized.

Example

await initializeDiarization({
  modelPath: {
    type: 'auto',
    path: 'models/pyannote-diarization'
  }
});

diarizeAudio()

Perform speaker diarization on an audio file.
await diarizeAudio(filePath: string): Promise<SpeakerSegment[]>

Parameters

filePath
string
required
Path to the audio file to diarize

Returns

Promise that resolves to an array of speaker segments.
SpeakerSegment[]
array

Example

const segments = await diarizeAudio('/path/to/meeting.wav');

// Group by speaker
const speakers = segments.reduce((acc, segment) => {
  if (!acc[segment.speakerId]) {
    acc[segment.speakerId] = [];
  }
  acc[segment.speakerId].push(segment);
  return acc;
}, {});

console.log(`Found ${Object.keys(speakers).length} speakers`);

unloadDiarization()

Release diarization model resources.
await unloadDiarization(): Promise<void>

Returns

Promise that resolves when resources are released.

Example

// When done with diarization
await unloadDiarization();

Types

DiarizationInitializeOptions

interface DiarizationInitializeOptions {
  modelPath: ModelPathConfig;
  // Additional options will be added in v0.4.0
}

SpeakerSegment

interface SpeakerSegment {
  speakerId: string;  // Unique speaker identifier
  start: number;      // Start time in seconds
  end: number;        // End time in seconds
  // Additional fields will be added in v0.4.0
}

ModelPathConfig

interface ModelPathConfig {
  type: 'auto' | 'file';
  path: string;
}

Best Practices

Diarization accuracy depends heavily on audio quality:
  • Minimize background noise: Clean audio produces better results
  • Avoid overlapping speech: Speakers talking simultaneously are harder to separate
  • Use appropriate microphones: Individual mics per speaker are ideal
  • Maintain consistent volume: Normalize audio levels across speakers
Diarization is most powerful when combined with speech recognition:
// 1. Diarize the audio
const speakers = await diarizeAudio('meeting.wav');

// 2. Transcribe each speaker segment
for (const segment of speakers) {
  const text = await transcribe({
    file: 'meeting.wav',
    start: segment.start,
    end: segment.end
  });
  console.log(`${segment.speakerId}: ${text}`);
}
Most diarization models automatically detect the number of speakers:
  • Don’t assume a fixed number of speakers
  • Handle cases with 1 speaker (monologue)
  • Consider maximum speaker limits for your use case
  • Post-process to merge or split segments if needed

Common Use Cases

Meeting Transcription

import { initializeDiarization, diarizeAudio } from 'react-native-sherpa-onnx/diarization';
import { transcribeRecognizer } from 'react-native-sherpa-onnx';

async function transcribeMeeting(audioPath: string) {
  // Initialize both systems
  await initializeDiarization({
    modelPath: { type: 'auto', path: 'models/diarization' }
  });
  
  // Get speaker segments
  const segments = await diarizeAudio(audioPath);
  
  // Create transcript with speaker labels
  const transcript = [];
  for (const segment of segments) {
    const text = await transcribeSegment(audioPath, segment.start, segment.end);
    transcript.push({
      speaker: segment.speakerId,
      text: text,
      timestamp: `${segment.start.toFixed(1)}s - ${segment.end.toFixed(1)}s`
    });
  }
  
  return transcript;
}

Speaker Timeline Visualization

const segments = await diarizeAudio('conversation.wav');

// Create timeline representation
const timeline = segments.map(segment => ({
  speaker: segment.speakerId,
  duration: segment.end - segment.start,
  startTime: segment.start
}));

// Calculate speaker talk time
const talkTime = segments.reduce((acc, segment) => {
  const duration = segment.end - segment.start;
  acc[segment.speakerId] = (acc[segment.speakerId] || 0) + duration;
  return acc;
}, {});

console.log('Talk time per speaker:', talkTime);

Error Handling

try {
  await initializeDiarization({
    modelPath: {
      type: 'auto',
      path: 'models/diarization'
    }
  });
  
  const segments = await diarizeAudio('audio.wav');
  
  if (segments.length === 0) {
    console.log('No speakers detected in audio');
  } else {
    const numSpeakers = new Set(segments.map(s => s.speakerId)).size;
    console.log(`Detected ${numSpeakers} speaker(s)`);
  }
} catch (error) {
  console.error('Diarization error:', error);
} finally {
  await unloadDiarization();
}

Performance Considerations

Speaker diarization is computationally intensive:
  • Processing time scales with audio length
  • Expect 0.1x - 0.5x real-time performance depending on model
  • Consider processing in chunks for long recordings
  • Use VAD preprocessing to skip silent segments

Voice Activity Detection

Preprocess audio to remove silence

Speech Recognition

Transcribe speaker segments

Speech Enhancement

Improve audio quality before diarization

Build docs developers (and LLMs) love