Metrics

Overview

The metrics module defines two primary categories of evaluation metrics: string-based metrics (STR_METRIC) and numerical metrics (NUM_METRIC). These constants are used throughout VERSA to properly handle different metric types during aggregation and summarization.

Metric Categories

String Metrics (STR_METRIC)

String metrics return textual values rather than numerical scores. These are typically used for:

Transcriptions and hypothesis text
Categorical classifications
Descriptive evaluations from language models

Location: versa/metrics.py:6-57

from versa.metrics import STR_METRIC

# String metrics are excluded from numerical aggregation
if metric_name in STR_METRIC:
    # Handle as text output
    pass
else:
    # Compute numerical average
    avg = sum(scores) / len(scores)

Numerical Metrics (NUM_METRIC)

Numerical metrics return scalar values that can be averaged or summed across utterances. Location: versa/metrics.py:59-219

from versa.metrics import NUM_METRIC

# Only numerical metrics are included in summary statistics
if metric_name in NUM_METRIC:
    average_score = sum(values) / len(values)

String Metric List

VAD and Speech Detection

vad_info

string

Voice Activity Detection information indicating speech segments.

Language and Speaker Characteristics

language

string

Detected or predicted language.

qwen_speaker_count

string

Number of speakers detected (from Qwen2-Audio model).

qwen_speaker_gender

string

Speaker gender classification (e.g., “male”, “female”).

qwen_speaker_age

string

Speaker age estimation (e.g., “young”, “middle-aged”, “elderly”).

qwen_speech_impairment

string

Detection of speech impairments or disorders.

Voice Properties

qwen_pitch_range

string

Pitch range classification (e.g., “narrow”, “wide”).

qwen_voice_pitch

string

Voice pitch level (e.g., “low”, “medium”, “high”).

qwen_voice_type

string

Voice type or quality descriptor.

qwen_speech_volume_level

string

Speech volume level (e.g., “quiet”, “normal”, “loud”).

Speech Content

qwen_language

string

Language detected by Qwen2-Audio model.

qwen_speech_register

string

Speech register or formality level (e.g., “formal”, “casual”).

qwen_vocabulary_complexity

string

Vocabulary complexity assessment (e.g., “simple”, “complex”).

qwen_speech_purpose

string

Purpose or intent of the speech (e.g., “informative”, “persuasive”).

Speech Delivery

qwen_speech_emotion

string

Emotional tone detected (e.g., “happy”, “sad”, “neutral”).

qwen_speech_clarity

string

Speech clarity assessment (e.g., “clear”, “unclear”).

qwen_speech_rate

string

Speech rate classification (e.g., “slow”, “normal”, “fast”).

qwen_speaking_style

string

Speaking style descriptor (e.g., “monotone”, “expressive”).

qwen_laughter_crying

string

Detection of non-speech vocalizations.

Recording Environment

qwen_speech_background_environment

string

Background environment description (e.g., “quiet”, “noisy”, “outdoor”).

qwen_overlapping_speech

string

Detection of overlapping speech from multiple speakers.

qwen_recording_quality

string

Overall recording quality assessment.

qwen_channel_type

string

Audio channel type (e.g., “mono”, “stereo”).

Transcription Outputs

ref_text

string

Reference transcription text.

espnet_hyp_text

string

Hypothesis transcription from ESPnet ASR model.

owsm_hyp_text

string

Hypothesis transcription from OWSM (Open Whisper-Style Model).

whisper_hyp_text

string

Hypothesis transcription from Whisper ASR model.

ARECHO String Metrics

ARECHO (Audio REcording CHaracterization with Ontologies) provides comprehensive audio analysis:

arecho_qwen_*

string

ARECHO variants of all Qwen2-Audio string metrics (vocabulary_complexity, speaker_age, voice_pitch, etc.).

arecho_rir_room_size

string

Room size estimation from room impulse response analysis.

arecho_real_language

string

Real language detection (ground truth comparison).

arecho_language

string

Predicted language from ARECHO analysis.

Numerical Metric List

MOS and Quality Prediction

dnsmos_overall

float

Overall quality score from DNSMOS (Deep Noise Suppression MOS predictor).

dnsmos_p808

float

P.808 standard MOS score from DNSMOS.

nisqa

float

Non-Intrusive Speech Quality Assessment score.

utmos

float

Universal MOS predictor score (range: 1-5).

utmosv2

float

UTMOS version 2 score with improved accuracy.

plcmos

float

Packet Loss Concealment MOS predictor.

singmos

float

Singing voice MOS predictor.

sheet_ssqa

float

Self-Supervised Speech Quality Assessment score.

Reference-Based Quality Metrics

pesq

float

Perceptual Evaluation of Speech Quality (range: -0.5 to 4.5).

stoi

float

Short-Time Objective Intelligibility (range: 0 to 1).

visqol

float

Virtual Speech Quality Objective Listener.

scoreq_ref

float

ScoreQ reference-based quality score.

scoreq_nr

float

ScoreQ no-reference quality score.

Signal-Based Metrics

mcd

float

Mel-Cepstral Distortion (dB) - lower is better.

f0_corr

float

F0 (fundamental frequency) correlation.

f0_rmse

float

F0 root mean squared error.

sir

float

Signal-to-Interference Ratio (dB).

sar

float

Signal-to-Artifact Ratio (dB).

sdr

float

Signal-to-Distortion Ratio (dB).

ci-sdr

float

Scale-invariant Signal-to-Distortion Ratio.

si-snr

float

Scale-invariant Signal-to-Noise Ratio (dB).

Sequence Alignment Metrics

warpq

float

WARP-Q sequence alignment quality metric.

speech_bert

float

Speech BERT similarity score.

speech_belu

float

Speech BLEU score for discrete speech representation.

speech_token_distance

float

Distance between discrete speech token sequences.

WER/CER Metrics

espnet_wer

float

Word Error Rate from ESPnet ASR (sum across utterances).

espnet_cer

float

Character Error Rate from ESPnet ASR.

owsm_wer

float

Word Error Rate from OWSM model.

owsm_cer

float

Character Error Rate from OWSM model.

whisper_wer

float

Word Error Rate from Whisper ASR.

whisper_cer

float

Character Error Rate from Whisper ASR.

WER and CER metrics are summed (not averaged) in the summary to compute total errors across the corpus.

Speaker and Emotion Similarity

spk_similarity

float

Speaker embedding cosine similarity (range: -1 to 1).

emotion_similarity

float

Emotion embedding similarity score.

Enhancement and Separation Metrics

torch_squim_pesq

float

PESQ score from TorchAudio SQUIM.

torch_squim_stoi

float

STOI score from TorchAudio SQUIM.

torch_squim_si_sdr

float

SI-SDR from TorchAudio SQUIM.

torch_squim_mos

float

MOS prediction from TorchAudio SQUIM.

se_snr

float

Signal-to-Noise Ratio from speech enhancement model.

se_si_snr

float

Scale-invariant SNR from enhancement model.

se_sdr

float

SDR from speech enhancement model.

se_ci_sdr

float

Scale-invariant SDR from enhancement model.

se_sar

float

SAR from speech enhancement model.

Perceptual Audio Metrics

pam

float

Perceptual Audio Metric score.

cdpam

float

Correspondence-aware DPAM metric.

dpam

float

Deep Perceptual Audio Metric.

Additional Metrics

srmr

float

Speech-to-Reverberation Modulation energy Ratio.

speaking_rate

float

Speaking rate in words per minute or phonemes per second.

asvspoof_score

float

Anti-spoofing score for deepfake detection.

nomad

float

NOMAD (Non-Matching Audio Distance) score.

clap_score

float

CLAP (Contrastive Language-Audio Pretraining) score.

apa

float

Audio Prompt Adherence score.

asr_match_error_rate

float

Error rate when matching ASR output between generated and reference audio.

log_wmse

float

Log-weighted Mean Squared Error.

noresqa

float

Non-intrusive Reference-free Speech Quality Assessment.

PySepm Metrics

Multiple metrics from the Python Speech Enhancement and Perception Metrics library:

pysepm_fwsegsnr

float

Frequency-weighted segmental SNR.

pysepm_wss

float

Weighted Spectral Slope distance.

pysepm_llr

float

Log-Likelihood Ratio.

pysepm_cd

float

Cepstral Distance.

pysepm_c_sig

float

Signal distortion composite measure.

pysepm_c_bak

float

Background noise composite measure.

pysepm_c_ovl

float

Overall quality composite measure.

pysepm_csii_high

float

Coherence-based Speech Intelligibility Index (high frequency).

pysepm_csii_mid

float

CSII mid frequency band.

pysepm_csii_low

float

CSII low frequency band.

pysepm_ncm

float

Normalized Covariance Measure.

Audiobox Aesthetics

audiobox_aesthetics_CE

float

Clarity and Expressiveness score.

audiobox_aesthetics_CU

float

Cleanliness and Understandability score.

audiobox_aesthetics_PC

float

Prosody and Coherence score.

audiobox_aesthetics_PQ

float

Perceptual Quality score.

ARECHO Numerical Metrics

ARECHO provides 60+ additional numerical metrics covering all aspects of audio quality:

arecho_*

float

ARECHO variants include: srmr, voicemos_real_mos, rt60, mcd, f0rmse, f0corr, stoi, pesq, visqol, nomad, nisqa metrics, DNS MOS scores, and all standard quality metrics.

Usage Examples

Filtering by Metric Type

from versa.metrics import STR_METRIC, NUM_METRIC

def process_scores(scores):
    """Separate numerical and string metrics."""
    numerical_scores = {}
    text_outputs = {}
    
    for key, value in scores.items():
        if key in STR_METRIC:
            text_outputs[key] = value
        elif key in NUM_METRIC:
            numerical_scores[key] = value
    
    return numerical_scores, text_outputs

Computing Averages

from versa.metrics import NUM_METRIC, STR_METRIC

def compute_summary(score_list):
    """Compute average scores across utterances."""
    summary = {}
    
    # Get all metric keys from first utterance
    for key in score_list[0].keys():
        if key == 'key' or key in STR_METRIC:
            continue  # Skip utterance IDs and string metrics
        
        # Sum scores
        total = sum(utt[key] for utt in score_list if key in utt)
        
        # Average for most metrics, sum for WER/CER
        if '_wer' in key or '_cer' in key:
            summary[key] = total  # Total errors
        else:
            summary[key] = total / len(score_list)  # Average
    
    return summary

Checking Metric Availability

from versa.metrics import NUM_METRIC

def validate_metric(metric_name):
    """Check if a metric is recognized."""
    if metric_name in NUM_METRIC:
        return True, "numerical"
    elif metric_name in STR_METRIC:
        return True, "string"
    else:
        return False, "unknown"

isValid, metricType = validate_metric("utmos")
print(f"UTMOS is {metricType}: {isValid}")
# Output: UTMOS is numerical: True

Notes

The metrics module is primarily used internally by the scorer to determine how to aggregate results. String metrics are excluded from numerical summaries.

WER and CER metrics are intentionally summed rather than averaged to compute total word/character errors across the corpus.

Core API

Utterance Metrics

Sequence Metrics

Corpus Metrics

Overview

Metric Categories

String Metrics (STR_METRIC)

Numerical Metrics (NUM_METRIC)

String Metric List

VAD and Speech Detection

Language and Speaker Characteristics

Voice Properties

Speech Content

Speech Delivery

Recording Environment

Transcription Outputs

ARECHO String Metrics

Numerical Metric List

MOS and Quality Prediction

Reference-Based Quality Metrics

Signal-Based Metrics

Sequence Alignment Metrics

WER/CER Metrics

Speaker and Emotion Similarity

Enhancement and Separation Metrics

Perceptual Audio Metrics

Additional Metrics

PySepm Metrics

Audiobox Aesthetics

ARECHO Numerical Metrics

Usage Examples

Filtering by Metric Type

Computing Averages

Checking Metric Availability

Notes

Build docs developers (and LLMs) love

Core API

Utterance Metrics

Sequence Metrics

Corpus Metrics

​Overview

​Metric Categories

​String Metrics (STR_METRIC)

​Numerical Metrics (NUM_METRIC)

​String Metric List

​VAD and Speech Detection

​Language and Speaker Characteristics

​Voice Properties

​Speech Content

​Speech Delivery

​Recording Environment

​Transcription Outputs

​ARECHO String Metrics

​Numerical Metric List

​MOS and Quality Prediction

​Reference-Based Quality Metrics

​Signal-Based Metrics

​Sequence Alignment Metrics

​WER/CER Metrics

​Speaker and Emotion Similarity

​Enhancement and Separation Metrics

​Perceptual Audio Metrics

​Additional Metrics

​PySepm Metrics

​Audiobox Aesthetics

​ARECHO Numerical Metrics

​Usage Examples

​Filtering by Metric Type

​Computing Averages

​Checking Metric Availability

​Notes

Build docs developers (and LLMs) love

Overview

Metric Categories

String Metrics (STR_METRIC)

Numerical Metrics (NUM_METRIC)

String Metric List

VAD and Speech Detection

Language and Speaker Characteristics

Voice Properties

Speech Content

Speech Delivery

Recording Environment

Transcription Outputs

ARECHO String Metrics

Numerical Metric List

MOS and Quality Prediction

Reference-Based Quality Metrics

Signal-Based Metrics

Sequence Alignment Metrics

WER/CER Metrics

Speaker and Emotion Similarity

Enhancement and Separation Metrics

Perceptual Audio Metrics

Additional Metrics

PySepm Metrics

Audiobox Aesthetics

ARECHO Numerical Metrics

Usage Examples

Filtering by Metric Type

Computing Averages

Checking Metric Availability

Notes