Overview
The metrics module defines two primary categories of evaluation metrics: string-based metrics (STR_METRIC) and numerical metrics (NUM_METRIC). These constants are used throughout VERSA to properly handle different metric types during aggregation and summarization.Metric Categories
String Metrics (STR_METRIC)
String metrics return textual values rather than numerical scores. These are typically used for:- Transcriptions and hypothesis text
- Categorical classifications
- Descriptive evaluations from language models
versa/metrics.py:6-57
Numerical Metrics (NUM_METRIC)
Numerical metrics return scalar values that can be averaged or summed across utterances. Location:versa/metrics.py:59-219
String Metric List
VAD and Speech Detection
Voice Activity Detection information indicating speech segments.
Language and Speaker Characteristics
Detected or predicted language.
Number of speakers detected (from Qwen2-Audio model).
Speaker gender classification (e.g., “male”, “female”).
Speaker age estimation (e.g., “young”, “middle-aged”, “elderly”).
Detection of speech impairments or disorders.
Voice Properties
Pitch range classification (e.g., “narrow”, “wide”).
Voice pitch level (e.g., “low”, “medium”, “high”).
Voice type or quality descriptor.
Speech volume level (e.g., “quiet”, “normal”, “loud”).
Speech Content
Language detected by Qwen2-Audio model.
Speech register or formality level (e.g., “formal”, “casual”).
Vocabulary complexity assessment (e.g., “simple”, “complex”).
Purpose or intent of the speech (e.g., “informative”, “persuasive”).
Speech Delivery
Emotional tone detected (e.g., “happy”, “sad”, “neutral”).
Speech clarity assessment (e.g., “clear”, “unclear”).
Speech rate classification (e.g., “slow”, “normal”, “fast”).
Speaking style descriptor (e.g., “monotone”, “expressive”).
Detection of non-speech vocalizations.
Recording Environment
Background environment description (e.g., “quiet”, “noisy”, “outdoor”).
Detection of overlapping speech from multiple speakers.
Overall recording quality assessment.
Audio channel type (e.g., “mono”, “stereo”).
Transcription Outputs
Reference transcription text.
Hypothesis transcription from ESPnet ASR model.
Hypothesis transcription from OWSM (Open Whisper-Style Model).
Hypothesis transcription from Whisper ASR model.
ARECHO String Metrics
ARECHO (Audio REcording CHaracterization with Ontologies) provides comprehensive audio analysis:ARECHO variants of all Qwen2-Audio string metrics (vocabulary_complexity, speaker_age, voice_pitch, etc.).
Room size estimation from room impulse response analysis.
Real language detection (ground truth comparison).
Predicted language from ARECHO analysis.
Numerical Metric List
MOS and Quality Prediction
Overall quality score from DNSMOS (Deep Noise Suppression MOS predictor).
P.808 standard MOS score from DNSMOS.
Non-Intrusive Speech Quality Assessment score.
Universal MOS predictor score (range: 1-5).
UTMOS version 2 score with improved accuracy.
Packet Loss Concealment MOS predictor.
Singing voice MOS predictor.
Self-Supervised Speech Quality Assessment score.
Reference-Based Quality Metrics
Perceptual Evaluation of Speech Quality (range: -0.5 to 4.5).
Short-Time Objective Intelligibility (range: 0 to 1).
Virtual Speech Quality Objective Listener.
ScoreQ reference-based quality score.
ScoreQ no-reference quality score.
Signal-Based Metrics
Mel-Cepstral Distortion (dB) - lower is better.
F0 (fundamental frequency) correlation.
F0 root mean squared error.
Signal-to-Interference Ratio (dB).
Signal-to-Artifact Ratio (dB).
Signal-to-Distortion Ratio (dB).
Scale-invariant Signal-to-Distortion Ratio.
Scale-invariant Signal-to-Noise Ratio (dB).
Sequence Alignment Metrics
WARP-Q sequence alignment quality metric.
Speech BERT similarity score.
Speech BLEU score for discrete speech representation.
Distance between discrete speech token sequences.
WER/CER Metrics
Word Error Rate from ESPnet ASR (sum across utterances).
Character Error Rate from ESPnet ASR.
Word Error Rate from OWSM model.
Character Error Rate from OWSM model.
Word Error Rate from Whisper ASR.
Character Error Rate from Whisper ASR.
WER and CER metrics are summed (not averaged) in the summary to compute total errors across the corpus.
Speaker and Emotion Similarity
Speaker embedding cosine similarity (range: -1 to 1).
Emotion embedding similarity score.
Enhancement and Separation Metrics
PESQ score from TorchAudio SQUIM.
STOI score from TorchAudio SQUIM.
SI-SDR from TorchAudio SQUIM.
MOS prediction from TorchAudio SQUIM.
Signal-to-Noise Ratio from speech enhancement model.
Scale-invariant SNR from enhancement model.
SDR from speech enhancement model.
Scale-invariant SDR from enhancement model.
SAR from speech enhancement model.
Perceptual Audio Metrics
Perceptual Audio Metric score.
Correspondence-aware DPAM metric.
Deep Perceptual Audio Metric.
Additional Metrics
Speech-to-Reverberation Modulation energy Ratio.
Speaking rate in words per minute or phonemes per second.
Anti-spoofing score for deepfake detection.
NOMAD (Non-Matching Audio Distance) score.
CLAP (Contrastive Language-Audio Pretraining) score.
Audio Prompt Adherence score.
Error rate when matching ASR output between generated and reference audio.
Log-weighted Mean Squared Error.
Non-intrusive Reference-free Speech Quality Assessment.
PySepm Metrics
Multiple metrics from the Python Speech Enhancement and Perception Metrics library:Frequency-weighted segmental SNR.
Weighted Spectral Slope distance.
Log-Likelihood Ratio.
Cepstral Distance.
Signal distortion composite measure.
Background noise composite measure.
Overall quality composite measure.
Coherence-based Speech Intelligibility Index (high frequency).
CSII mid frequency band.
CSII low frequency band.
Normalized Covariance Measure.
Audiobox Aesthetics
Clarity and Expressiveness score.
Cleanliness and Understandability score.
Prosody and Coherence score.
Perceptual Quality score.
ARECHO Numerical Metrics
ARECHO provides 60+ additional numerical metrics covering all aspects of audio quality:ARECHO variants include: srmr, voicemos_real_mos, rt60, mcd, f0rmse, f0corr, stoi, pesq, visqol, nomad, nisqa metrics, DNS MOS scores, and all standard quality metrics.
Usage Examples
Filtering by Metric Type
Computing Averages
Checking Metric Availability
Notes
The metrics module is primarily used internally by the scorer to determine how to aggregate results. String metrics are excluded from numerical summaries.