Overview
VERSA provides a flexible framework for implementing custom audio evaluation metrics. This guide walks you through the process of creating, registering, and testing your own metrics.
Metric Types
VERSA supports three types of metrics based on their computational scope:
- utterance_metrics: Utterance-level metrics for individual audio samples
- sequence_metrics: Metrics comparing two feature sequences (will be merged with utterance_metrics in future versions)
- corpus_metrics: Metrics requiring the entire corpus (e.g., FAD, WER)
Implementation Structure
Two-Function Pattern
Most metrics follow a two-function pattern: one for model setup and one for inference.
Create setup function
The setup function initializes models and loads resources:versa/utterance_metrics/speaker.py
def speaker_model_setup(
model_tag="default", model_path=None, model_config=None, use_gpu=False
):
if use_gpu:
device = "cuda"
else:
device = "cpu"
if model_path is not None and model_config is not None:
model = Speech2Embedding(
model_file=model_path, train_config=model_config, device=device
)
else:
if model_tag == "default":
model_tag = "espnet/voxcelebs12_rawnet3"
model = Speech2Embedding.from_pretrained(model_tag=model_tag, device=device)
return model
The use_gpu parameter is required by convention and defaults to False.
Create inference function
The inference function computes the metric:versa/utterance_metrics/speaker.py
def speaker_metric(model, pred_x, gt_x, fs):
# Resample if needed
if fs != 16000:
gt_x = librosa.resample(gt_x, orig_sr=fs, target_sr=16000)
pred_x = librosa.resample(pred_x, orig_sr=fs, target_sr=16000)
# Compute embeddings
embedding_gen = model(pred_x).squeeze(0).cpu().numpy()
embedding_gt = model(gt_x).squeeze(0).cpu().numpy()
# Calculate similarity
similarity = np.dot(embedding_gen, embedding_gt) / (
np.linalg.norm(embedding_gen) * np.linalg.norm(embedding_gt)
)
return {"spk_similarity": similarity}
The inference function receives:
model: The inference model from setup
pred_x: Audio signal to evaluate
fs: Audio sampling rate
gt_x: (Optional) Reference audio signal
ref_text: (Optional) Text transcription or description
Add test function
Include a simple test at the end of your file:if __name__ == "__main__":
a = np.random.random(16000)
b = np.random.random(16000)
model = speaker_model_setup()
print("metrics: {}".format(speaker_metric(model, a, b, 16000)))
Simplified Single-Function Pattern
For metrics without model setup, use a single inference function:
versa/utterance_metrics/stoi.py
from pystoi import stoi
def stoi_metric(pred_x, gt_x, fs):
if pred_x.shape[0] != gt_x.shape[0]:
min_length = min(pred_x.shape[0], gt_x.shape[0])
pred_x = pred_x[:min_length]
gt_x = gt_x[:min_length]
score = stoi(gt_x, pred_x, fs, extended=False)
return {"stoi": score}
Metric Registration
Register in scorer_shared.py
Add your metric to both load_score_modules() and use_score_modules() in versa/scorer_shared.py:# In load_score_modules()
elif config["name"] == "my_metric":
logging.info("Loading My Metric evaluation...")
from versa.utterance_metrics import my_metric
score_modules["my_metric"] = {
"model": my_metric.my_metric_setup(use_gpu=use_gpu),
"module": my_metric,
}
logging.info("Initiated My Metric successfully.")
# In use_score_modules()
elif score_type == "my_metric":
scores = score_modules["my_metric"]["module"].my_metric_metric(
model=score_modules["my_metric"]["model"],
pred_x=pred_x,
gt_x=gt_x,
fs=pred_fs,
)
Choose a unique key for your metric to avoid conflicts with existing metrics.
Register in metrics.py
Declare whether your metric returns numerical or string values in versa/metrics.py:NUM_METRIC = [
"spk_similarity",
"my_metric_score", # Add your metric key here
# ...
]
STR_METRIC = [
"language",
# Add here if your metric returns strings
]
Update documentation
Add your metric to docs/supported_metrics.md:
- Mark the Auto-Install column with ‘x’ if included in default installation
- Leave blank if it requires external tools from the tools directory
- Provide metric name, config key, report key, code source, and reference paper
Testing
Local testing
Test your metric implementation directly:python versa/utterance_metrics/my_metric.py
Integration testing
For metrics in default installation:# Add expected test values
test/test_metrics/test_my_metric.py
# Create dedicated test file for CI
For optional metrics:test/test_pipeline/test_my_metric.py
Create example configuration
Add a YAML configuration file in egs/separate_metrics/:score:
- name: my_metric
fs: 16000
use_gpu: true
Before submitting, format your code:
black versa/utterance_metrics/my_metric.py
isort versa/utterance_metrics/my_metric.py
VERSA enforces code style checks in CI tests. Using black and isort ensures your code passes these checks.
External Dependencies
For metrics requiring specific package versions:
Fork the repository
Create a fork of the external repository.
Add custom interface
Modify the fork to match VERSA’s interface requirements.
Create installation script
Add a localized install script to the tools/ directory:tools/install_my_metric.sh
#!/bin/bash
git clone https://github.com/yourfork/metric-repo.git
cd metric-repo
pip install -e .
Best Practices
- Naming conventions: Use consistent function names ending with
_setup and _metric
- Error handling: Check for required dependencies with try/except blocks
- Resampling: Always verify and resample audio to the required sample rate
- Return format: Return a dictionary with descriptive keys
- Documentation: Include docstrings explaining parameters and return values
- GPU support: Implement the
use_gpu parameter even if not strictly needed
Example: Complete Custom Metric
versa/utterance_metrics/example_metric.py
#!/usr/bin/env python3
import numpy as np
import librosa
def example_metric_setup(model_path=None, use_gpu=False):
"""Setup function for example metric.
Args:
model_path: Optional path to custom model
use_gpu: Whether to use GPU acceleration
Returns:
Initialized model or configuration
"""
device = "cuda" if use_gpu else "cpu"
# Initialize your model here
model = load_model(model_path, device=device)
return model
def example_metric(model, pred_x, fs, gt_x=None, ref_text=None):
"""Compute example metric.
Args:
model: Model from setup function
pred_x: Predicted audio signal
fs: Sample rate
gt_x: Optional ground truth audio
ref_text: Optional reference text
Returns:
Dictionary with metric scores
"""
# Resample if needed
target_fs = 16000
if fs != target_fs:
pred_x = librosa.resample(pred_x, orig_sr=fs, target_sr=target_fs)
if gt_x is not None:
gt_x = librosa.resample(gt_x, orig_sr=fs, target_sr=target_fs)
# Compute your metric
score = model.compute(pred_x)
return {"example_score": score}
if __name__ == "__main__":
# Simple test
test_audio = np.random.random(16000)
model = example_metric_setup()
result = example_metric(model, test_audio, 16000)
print(f"Test result: {result}")
Next Steps
- Review existing metrics in
versa/utterance_metrics/ for more examples
- Join the VERSA community to discuss your metric implementation
- Consider contributing your metric back to the main repository