Skip to main content

Overview

The metrics module defines two primary categories of evaluation metrics: string-based metrics (STR_METRIC) and numerical metrics (NUM_METRIC). These constants are used throughout VERSA to properly handle different metric types during aggregation and summarization.

Metric Categories

String Metrics (STR_METRIC)

String metrics return textual values rather than numerical scores. These are typically used for:
  • Transcriptions and hypothesis text
  • Categorical classifications
  • Descriptive evaluations from language models
Location: versa/metrics.py:6-57
from versa.metrics import STR_METRIC

# String metrics are excluded from numerical aggregation
if metric_name in STR_METRIC:
    # Handle as text output
    pass
else:
    # Compute numerical average
    avg = sum(scores) / len(scores)

Numerical Metrics (NUM_METRIC)

Numerical metrics return scalar values that can be averaged or summed across utterances. Location: versa/metrics.py:59-219
from versa.metrics import NUM_METRIC

# Only numerical metrics are included in summary statistics
if metric_name in NUM_METRIC:
    average_score = sum(values) / len(values)

String Metric List

VAD and Speech Detection

vad_info
string
Voice Activity Detection information indicating speech segments.

Language and Speaker Characteristics

language
string
Detected or predicted language.
qwen_speaker_count
string
Number of speakers detected (from Qwen2-Audio model).
qwen_speaker_gender
string
Speaker gender classification (e.g., “male”, “female”).
qwen_speaker_age
string
Speaker age estimation (e.g., “young”, “middle-aged”, “elderly”).
qwen_speech_impairment
string
Detection of speech impairments or disorders.

Voice Properties

qwen_pitch_range
string
Pitch range classification (e.g., “narrow”, “wide”).
qwen_voice_pitch
string
Voice pitch level (e.g., “low”, “medium”, “high”).
qwen_voice_type
string
Voice type or quality descriptor.
qwen_speech_volume_level
string
Speech volume level (e.g., “quiet”, “normal”, “loud”).

Speech Content

qwen_language
string
Language detected by Qwen2-Audio model.
qwen_speech_register
string
Speech register or formality level (e.g., “formal”, “casual”).
qwen_vocabulary_complexity
string
Vocabulary complexity assessment (e.g., “simple”, “complex”).
qwen_speech_purpose
string
Purpose or intent of the speech (e.g., “informative”, “persuasive”).

Speech Delivery

qwen_speech_emotion
string
Emotional tone detected (e.g., “happy”, “sad”, “neutral”).
qwen_speech_clarity
string
Speech clarity assessment (e.g., “clear”, “unclear”).
qwen_speech_rate
string
Speech rate classification (e.g., “slow”, “normal”, “fast”).
qwen_speaking_style
string
Speaking style descriptor (e.g., “monotone”, “expressive”).
qwen_laughter_crying
string
Detection of non-speech vocalizations.

Recording Environment

qwen_speech_background_environment
string
Background environment description (e.g., “quiet”, “noisy”, “outdoor”).
qwen_overlapping_speech
string
Detection of overlapping speech from multiple speakers.
qwen_recording_quality
string
Overall recording quality assessment.
qwen_channel_type
string
Audio channel type (e.g., “mono”, “stereo”).

Transcription Outputs

ref_text
string
Reference transcription text.
espnet_hyp_text
string
Hypothesis transcription from ESPnet ASR model.
owsm_hyp_text
string
Hypothesis transcription from OWSM (Open Whisper-Style Model).
whisper_hyp_text
string
Hypothesis transcription from Whisper ASR model.

ARECHO String Metrics

ARECHO (Audio REcording CHaracterization with Ontologies) provides comprehensive audio analysis:
arecho_qwen_*
string
ARECHO variants of all Qwen2-Audio string metrics (vocabulary_complexity, speaker_age, voice_pitch, etc.).
arecho_rir_room_size
string
Room size estimation from room impulse response analysis.
arecho_real_language
string
Real language detection (ground truth comparison).
arecho_language
string
Predicted language from ARECHO analysis.

Numerical Metric List

MOS and Quality Prediction

dnsmos_overall
float
Overall quality score from DNSMOS (Deep Noise Suppression MOS predictor).
dnsmos_p808
float
P.808 standard MOS score from DNSMOS.
nisqa
float
Non-Intrusive Speech Quality Assessment score.
utmos
float
Universal MOS predictor score (range: 1-5).
utmosv2
float
UTMOS version 2 score with improved accuracy.
plcmos
float
Packet Loss Concealment MOS predictor.
singmos
float
Singing voice MOS predictor.
sheet_ssqa
float
Self-Supervised Speech Quality Assessment score.

Reference-Based Quality Metrics

pesq
float
Perceptual Evaluation of Speech Quality (range: -0.5 to 4.5).
stoi
float
Short-Time Objective Intelligibility (range: 0 to 1).
visqol
float
Virtual Speech Quality Objective Listener.
scoreq_ref
float
ScoreQ reference-based quality score.
scoreq_nr
float
ScoreQ no-reference quality score.

Signal-Based Metrics

mcd
float
Mel-Cepstral Distortion (dB) - lower is better.
f0_corr
float
F0 (fundamental frequency) correlation.
f0_rmse
float
F0 root mean squared error.
sir
float
Signal-to-Interference Ratio (dB).
sar
float
Signal-to-Artifact Ratio (dB).
sdr
float
Signal-to-Distortion Ratio (dB).
ci-sdr
float
Scale-invariant Signal-to-Distortion Ratio.
si-snr
float
Scale-invariant Signal-to-Noise Ratio (dB).

Sequence Alignment Metrics

warpq
float
WARP-Q sequence alignment quality metric.
speech_bert
float
Speech BERT similarity score.
speech_belu
float
Speech BLEU score for discrete speech representation.
speech_token_distance
float
Distance between discrete speech token sequences.

WER/CER Metrics

espnet_wer
float
Word Error Rate from ESPnet ASR (sum across utterances).
espnet_cer
float
Character Error Rate from ESPnet ASR.
owsm_wer
float
Word Error Rate from OWSM model.
owsm_cer
float
Character Error Rate from OWSM model.
whisper_wer
float
Word Error Rate from Whisper ASR.
whisper_cer
float
Character Error Rate from Whisper ASR.
WER and CER metrics are summed (not averaged) in the summary to compute total errors across the corpus.

Speaker and Emotion Similarity

spk_similarity
float
Speaker embedding cosine similarity (range: -1 to 1).
emotion_similarity
float
Emotion embedding similarity score.

Enhancement and Separation Metrics

torch_squim_pesq
float
PESQ score from TorchAudio SQUIM.
torch_squim_stoi
float
STOI score from TorchAudio SQUIM.
torch_squim_si_sdr
float
SI-SDR from TorchAudio SQUIM.
torch_squim_mos
float
MOS prediction from TorchAudio SQUIM.
se_snr
float
Signal-to-Noise Ratio from speech enhancement model.
se_si_snr
float
Scale-invariant SNR from enhancement model.
se_sdr
float
SDR from speech enhancement model.
se_ci_sdr
float
Scale-invariant SDR from enhancement model.
se_sar
float
SAR from speech enhancement model.

Perceptual Audio Metrics

pam
float
Perceptual Audio Metric score.
cdpam
float
Correspondence-aware DPAM metric.
dpam
float
Deep Perceptual Audio Metric.

Additional Metrics

srmr
float
Speech-to-Reverberation Modulation energy Ratio.
speaking_rate
float
Speaking rate in words per minute or phonemes per second.
asvspoof_score
float
Anti-spoofing score for deepfake detection.
nomad
float
NOMAD (Non-Matching Audio Distance) score.
clap_score
float
CLAP (Contrastive Language-Audio Pretraining) score.
apa
float
Audio Prompt Adherence score.
asr_match_error_rate
float
Error rate when matching ASR output between generated and reference audio.
log_wmse
float
Log-weighted Mean Squared Error.
noresqa
float
Non-intrusive Reference-free Speech Quality Assessment.

PySepm Metrics

Multiple metrics from the Python Speech Enhancement and Perception Metrics library:
pysepm_fwsegsnr
float
Frequency-weighted segmental SNR.
pysepm_wss
float
Weighted Spectral Slope distance.
pysepm_llr
float
Log-Likelihood Ratio.
pysepm_cd
float
Cepstral Distance.
pysepm_c_sig
float
Signal distortion composite measure.
pysepm_c_bak
float
Background noise composite measure.
pysepm_c_ovl
float
Overall quality composite measure.
pysepm_csii_high
float
Coherence-based Speech Intelligibility Index (high frequency).
pysepm_csii_mid
float
CSII mid frequency band.
pysepm_csii_low
float
CSII low frequency band.
pysepm_ncm
float
Normalized Covariance Measure.

Audiobox Aesthetics

audiobox_aesthetics_CE
float
Clarity and Expressiveness score.
audiobox_aesthetics_CU
float
Cleanliness and Understandability score.
audiobox_aesthetics_PC
float
Prosody and Coherence score.
audiobox_aesthetics_PQ
float
Perceptual Quality score.

ARECHO Numerical Metrics

ARECHO provides 60+ additional numerical metrics covering all aspects of audio quality:
arecho_*
float
ARECHO variants include: srmr, voicemos_real_mos, rt60, mcd, f0rmse, f0corr, stoi, pesq, visqol, nomad, nisqa metrics, DNS MOS scores, and all standard quality metrics.

Usage Examples

Filtering by Metric Type

from versa.metrics import STR_METRIC, NUM_METRIC

def process_scores(scores):
    """Separate numerical and string metrics."""
    numerical_scores = {}
    text_outputs = {}
    
    for key, value in scores.items():
        if key in STR_METRIC:
            text_outputs[key] = value
        elif key in NUM_METRIC:
            numerical_scores[key] = value
    
    return numerical_scores, text_outputs

Computing Averages

from versa.metrics import NUM_METRIC, STR_METRIC

def compute_summary(score_list):
    """Compute average scores across utterances."""
    summary = {}
    
    # Get all metric keys from first utterance
    for key in score_list[0].keys():
        if key == 'key' or key in STR_METRIC:
            continue  # Skip utterance IDs and string metrics
        
        # Sum scores
        total = sum(utt[key] for utt in score_list if key in utt)
        
        # Average for most metrics, sum for WER/CER
        if '_wer' in key or '_cer' in key:
            summary[key] = total  # Total errors
        else:
            summary[key] = total / len(score_list)  # Average
    
    return summary

Checking Metric Availability

from versa.metrics import NUM_METRIC

def validate_metric(metric_name):
    """Check if a metric is recognized."""
    if metric_name in NUM_METRIC:
        return True, "numerical"
    elif metric_name in STR_METRIC:
        return True, "string"
    else:
        return False, "unknown"

isValid, metricType = validate_metric("utmos")
print(f"UTMOS is {metricType}: {isValid}")
# Output: UTMOS is numerical: True

Notes

The metrics module is primarily used internally by the scorer to determine how to aggregate results. String metrics are excluded from numerical summaries.
WER and CER metrics are intentionally summed rather than averaged to compute total word/character errors across the corpus.

Build docs developers (and LLMs) love