Coming in v0.4.0 - This feature is planned for the next release. The API interface is subject to change.
Overview
Speaker Diarization identifies “who spoke when” in an audio recording, segmenting the audio by speaker. This is useful for:
Meeting transcription with speaker labels
Call center analytics
Interview transcription
Multi-speaker content analysis
Podcast and video production
Installation
Diarization will be included in the main package:
npm install react-native-sherpa-onnx
Basic Usage
import { initializeDiarization , diarizeAudio } from 'react-native-sherpa-onnx/diarization' ;
// Initialize diarization with model
await initializeDiarization ({
modelPath: {
type: 'auto' ,
path: 'models/diarization-model'
}
});
// Diarize audio
const segments = await diarizeAudio ( 'path/to/conversation.wav' );
console . log ( 'Speaker segments:' , segments );
// [
// { speakerId: 'speaker_0', start: 0.0, end: 3.5 },
// { speakerId: 'speaker_1', start: 3.5, end: 7.2 },
// { speakerId: 'speaker_0', start: 7.2, end: 10.1 }
// ]
API Reference
initializeDiarization()
Initialize the Speaker Diarization model.
await initializeDiarization ( options : DiarizationInitializeOptions ): Promise < void >
Parameters
options
DiarizationInitializeOptions
required
Configuration options for diarization initialization Path configuration for the diarization model Type of model path resolution
Path to the model directory
Returns
Promise that resolves when diarization is initialized.
Example
await initializeDiarization ({
modelPath: {
type: 'auto' ,
path: 'models/pyannote-diarization'
}
});
diarizeAudio()
Perform speaker diarization on an audio file.
await diarizeAudio ( filePath : string ): Promise < SpeakerSegment [] >
Parameters
Path to the audio file to diarize
Returns
Promise that resolves to an array of speaker segments.
Show SpeakerSegment properties
Unique identifier for the speaker (e.g., “speaker_0”, “speaker_1”)
Start time of the speaker segment in seconds
End time of the speaker segment in seconds
Example
const segments = await diarizeAudio ( '/path/to/meeting.wav' );
// Group by speaker
const speakers = segments . reduce (( acc , segment ) => {
if ( ! acc [ segment . speakerId ]) {
acc [ segment . speakerId ] = [];
}
acc [ segment . speakerId ]. push ( segment );
return acc ;
}, {});
console . log ( `Found ${ Object . keys ( speakers ). length } speakers` );
unloadDiarization()
Release diarization model resources.
await unloadDiarization (): Promise < void >
Returns
Promise that resolves when resources are released.
Example
// When done with diarization
await unloadDiarization ();
Types
DiarizationInitializeOptions
interface DiarizationInitializeOptions {
modelPath : ModelPathConfig ;
// Additional options will be added in v0.4.0
}
SpeakerSegment
interface SpeakerSegment {
speakerId : string ; // Unique speaker identifier
start : number ; // Start time in seconds
end : number ; // End time in seconds
// Additional fields will be added in v0.4.0
}
ModelPathConfig
interface ModelPathConfig {
type : 'auto' | 'file' ;
path : string ;
}
Best Practices
Diarization accuracy depends heavily on audio quality:
Minimize background noise : Clean audio produces better results
Avoid overlapping speech : Speakers talking simultaneously are harder to separate
Use appropriate microphones : Individual mics per speaker are ideal
Maintain consistent volume : Normalize audio levels across speakers
Combine with transcription
Diarization is most powerful when combined with speech recognition: // 1. Diarize the audio
const speakers = await diarizeAudio ( 'meeting.wav' );
// 2. Transcribe each speaker segment
for ( const segment of speakers ) {
const text = await transcribe ({
file: 'meeting.wav' ,
start: segment . start ,
end: segment . end
});
console . log ( ` ${ segment . speakerId } : ${ text } ` );
}
Handle unknown speaker counts
Most diarization models automatically detect the number of speakers:
Don’t assume a fixed number of speakers
Handle cases with 1 speaker (monologue)
Consider maximum speaker limits for your use case
Post-process to merge or split segments if needed
Common Use Cases
Meeting Transcription
import { initializeDiarization , diarizeAudio } from 'react-native-sherpa-onnx/diarization' ;
import { transcribeRecognizer } from 'react-native-sherpa-onnx' ;
async function transcribeMeeting ( audioPath : string ) {
// Initialize both systems
await initializeDiarization ({
modelPath: { type: 'auto' , path: 'models/diarization' }
});
// Get speaker segments
const segments = await diarizeAudio ( audioPath );
// Create transcript with speaker labels
const transcript = [];
for ( const segment of segments ) {
const text = await transcribeSegment ( audioPath , segment . start , segment . end );
transcript . push ({
speaker: segment . speakerId ,
text: text ,
timestamp: ` ${ segment . start . toFixed ( 1 ) } s - ${ segment . end . toFixed ( 1 ) } s`
});
}
return transcript ;
}
Speaker Timeline Visualization
const segments = await diarizeAudio ( 'conversation.wav' );
// Create timeline representation
const timeline = segments . map ( segment => ({
speaker: segment . speakerId ,
duration: segment . end - segment . start ,
startTime: segment . start
}));
// Calculate speaker talk time
const talkTime = segments . reduce (( acc , segment ) => {
const duration = segment . end - segment . start ;
acc [ segment . speakerId ] = ( acc [ segment . speakerId ] || 0 ) + duration ;
return acc ;
}, {});
console . log ( 'Talk time per speaker:' , talkTime );
Error Handling
try {
await initializeDiarization ({
modelPath: {
type: 'auto' ,
path: 'models/diarization'
}
});
const segments = await diarizeAudio ( 'audio.wav' );
if ( segments . length === 0 ) {
console . log ( 'No speakers detected in audio' );
} else {
const numSpeakers = new Set ( segments . map ( s => s . speakerId )). size ;
console . log ( `Detected ${ numSpeakers } speaker(s)` );
}
} catch ( error ) {
console . error ( 'Diarization error:' , error );
} finally {
await unloadDiarization ();
}
Speaker diarization is computationally intensive:
Processing time scales with audio length
Expect 0.1x - 0.5x real-time performance depending on model
Consider processing in chunks for long recordings
Use VAD preprocessing to skip silent segments
Voice Activity Detection Preprocess audio to remove silence
Speech Recognition Transcribe speaker segments
Speech Enhancement Improve audio quality before diarization