Speaker Similarity
Model Setup
Parameters
Model identifier. Options:
"default"or"espnet/voxcelebs12_rawnet3"- RawNet3 trained on VoxCeleb- Any ESPnet speaker model tag
Path to custom model file. If provided,
model_config must also be specifiedPath to model configuration file (required if using
model_path)Whether to use GPU for inference
Returns
ESPnet speaker embedding model
Metric Calculation
Parameters
Speaker model from
speaker_model_setup()Predicted audio signal (1D array)
Reference audio signal (1D array)
Sampling rate in Hz. Audio is resampled to 16 kHz if needed
Returns
Cosine similarity between speaker embeddings, ranging from -1 to 1 (higher is better)
- Values close to 1: Same speaker
- Values close to 0: Unrelated speakers
- Values close to -1: Opposite characteristics (rare)
Usage Example
Emotion Similarity
Model Setup
Parameters
Model identifier. Options:
"default"or"base"- emotion2vec base model
Path to custom model checkpoint. Overrides
model_tag if providedWhether to use GPU for inference
Returns
emotion2vec model for emotion embedding extraction
Raises
ImportError: If emo2vec_versa is not installedValueError: If model_tag is unknownFileNotFoundError: If model file is not found
Metric Calculation
Parameters
Emotion model from
emo2vec_setup()Predicted audio signal (1D array)
Reference audio signal (1D array)
Sampling rate in Hz. Audio is resampled to 16 kHz if needed
Returns
Cosine similarity between emotion embeddings, ranging from -1 to 1 (higher is better)
- Values close to 1: Similar emotional content
- Values close to 0: Unrelated emotions
Usage Example
Installation
Technical Details
Sampling Rate: Both metrics require 16 kHz audio. Automatic resampling is performed if needed.
Embedding Method: Both use cosine similarity of L2-normalized embeddings for comparison.
Use Cases
Speaker Similarity
- Voice conversion quality assessment
- Speaker verification systems
- Voice cloning evaluation
- Multi-speaker TTS validation
Emotion Similarity
- Emotional TTS evaluation
- Affective computing validation
- Speech emotion transfer quality
- Expressive voice synthesis assessment
Model Details
| Metric | Model | Sampling Rate | Embedding Dim |
|---|---|---|---|
| Speaker | RawNet3 (VoxCeleb) | 16 kHz | Varies |
| Emotion | emotion2vec base | 16 kHz | Varies |
Interpretation Guidelines
Speaker Similarity
- > 0.8: Very likely same speaker
- 0.5 - 0.8: Possibly same speaker or very similar voice
- 0.3 - 0.5: Different speakers with some similarities
- < 0.3: Clearly different speakers
Emotion Similarity
- > 0.7: Very similar emotional expression
- 0.4 - 0.7: Related emotions
- 0.2 - 0.4: Somewhat different emotions
- < 0.2: Very different emotional content