Skip to main content
These metrics measure similarity between audio samples using deep learning embeddings for speaker verification and emotion recognition.

Speaker Similarity

Model Setup

speaker_model_setup(
    model_tag="default",
    model_path=None,
    model_config=None,
    use_gpu=False
)

Parameters

model_tag
str
default:"default"
Model identifier. Options:
  • "default" or "espnet/voxcelebs12_rawnet3" - RawNet3 trained on VoxCeleb
  • Any ESPnet speaker model tag
model_path
str
Path to custom model file. If provided, model_config must also be specified
model_config
str
Path to model configuration file (required if using model_path)
use_gpu
bool
default:"false"
Whether to use GPU for inference

Returns

model
Speech2Embedding
ESPnet speaker embedding model

Metric Calculation

speaker_metric(
    model,
    pred_x,
    gt_x,
    fs
)

Parameters

model
Speech2Embedding
required
Speaker model from speaker_model_setup()
pred_x
numpy.ndarray
required
Predicted audio signal (1D array)
gt_x
numpy.ndarray
required
Reference audio signal (1D array)
fs
int
required
Sampling rate in Hz. Audio is resampled to 16 kHz if needed

Returns

spk_similarity
float
Cosine similarity between speaker embeddings, ranging from -1 to 1 (higher is better)
  • Values close to 1: Same speaker
  • Values close to 0: Unrelated speakers
  • Values close to -1: Opposite characteristics (rare)

Usage Example

import numpy as np
from versa import speaker_model_setup, speaker_metric

# Setup speaker model
model = speaker_model_setup(
    model_tag="espnet/voxcelebs12_rawnet3",
    use_gpu=True
)

# Load audio signals
reference_audio = np.random.random(16000)  # Replace with actual audio
test_audio = np.random.random(16000)       # Replace with actual audio
fs = 16000

# Calculate speaker similarity
result = speaker_metric(
    model=model,
    pred_x=test_audio,
    gt_x=reference_audio,
    fs=fs
)

print(f"Speaker Similarity: {result['spk_similarity']:.3f}")

Emotion Similarity

Model Setup

emo2vec_setup(
    model_tag="default",
    model_path=None,
    use_gpu=False
)

Parameters

model_tag
str
default:"default"
Model identifier. Options:
  • "default" or "base" - emotion2vec base model
model_path
str
Path to custom model checkpoint. Overrides model_tag if provided
use_gpu
bool
default:"false"
Whether to use GPU for inference

Returns

model
EMO2VEC
emotion2vec model for emotion embedding extraction

Raises

  • ImportError: If emo2vec_versa is not installed
  • ValueError: If model_tag is unknown
  • FileNotFoundError: If model file is not found

Metric Calculation

emo_sim(
    model,
    pred_x,
    gt_x,
    fs
)

Parameters

model
EMO2VEC
required
Emotion model from emo2vec_setup()
pred_x
numpy.ndarray
required
Predicted audio signal (1D array)
gt_x
numpy.ndarray
required
Reference audio signal (1D array)
fs
int
required
Sampling rate in Hz. Audio is resampled to 16 kHz if needed

Returns

emotion_similarity
float
Cosine similarity between emotion embeddings, ranging from -1 to 1 (higher is better)
  • Values close to 1: Similar emotional content
  • Values close to 0: Unrelated emotions

Usage Example

import numpy as np
from versa import emo2vec_setup, emo_sim

# Setup emotion model
model = emo2vec_setup(
    model_tag="default",
    use_gpu=True
)

# Load audio signals
reference_audio = np.random.random(16000)  # Replace with actual audio
test_audio = np.random.random(16000)       # Replace with actual audio
fs = 16000

# Calculate emotion similarity
result = emo_sim(
    model=model,
    pred_x=test_audio,
    gt_x=reference_audio,
    fs=fs
)

print(f"Emotion Similarity: {result['emotion_similarity']:.3f}")

Installation

pip install espnet

Technical Details

Sampling Rate: Both metrics require 16 kHz audio. Automatic resampling is performed if needed.
Embedding Method: Both use cosine similarity of L2-normalized embeddings for comparison.

Use Cases

Speaker Similarity

  • Voice conversion quality assessment
  • Speaker verification systems
  • Voice cloning evaluation
  • Multi-speaker TTS validation

Emotion Similarity

  • Emotional TTS evaluation
  • Affective computing validation
  • Speech emotion transfer quality
  • Expressive voice synthesis assessment

Model Details

MetricModelSampling RateEmbedding Dim
SpeakerRawNet3 (VoxCeleb)16 kHzVaries
Emotionemotion2vec base16 kHzVaries

Interpretation Guidelines

Speaker Similarity

  • > 0.8: Very likely same speaker
  • 0.5 - 0.8: Possibly same speaker or very similar voice
  • 0.3 - 0.5: Different speakers with some similarities
  • < 0.3: Clearly different speakers

Emotion Similarity

  • > 0.7: Very similar emotional expression
  • 0.4 - 0.7: Related emotions
  • 0.2 - 0.4: Somewhat different emotions
  • < 0.2: Very different emotional content

Build docs developers (and LLMs) love