Overview
The four metric categories are:Independent
No reference required - evaluate audio directly
Dependent
Require matching reference audio for comparison
Non-match
Use non-matching references like text or embeddings
Distributional
Evaluate distributions across entire datasets
Independent Metrics
Independent metrics evaluate audio quality without requiring any reference. These are ideal for quick assessments and scenarios where ground truth is unavailable.Common Use Cases
- MOS (Mean Opinion Score) prediction
- Speech quality assessment
- Audio property detection
- Speaker characteristic analysis
Example Metrics
- DNSMOS
- NISQA
- UTMOS
- VAD
Deep Noise Suppression MOS Score - predicts perceived quality without reference.Output Keys:
dnsmos_overall, dnsmos_p808Most independent metrics in VERSA are auto-installed and ready to use without additional setup.
Advanced Independent Metrics
VERSA includes sophisticated independent metrics powered by audio-language models:Qwen2 Audio Metrics
Qwen2 Audio Metrics
The Qwen2 Audio model provides detailed analysis across multiple dimensions:Speaker Characteristics: gender, age, count, speech impairment
Voice Properties: pitch, pitch range, voice type, volume level
Speech Content: language, register, vocabulary complexity, purpose
Speech Delivery: emotion, clarity, rate, style, emotional vocalizations
Environment: background, recording quality, channel typeExample configuration:
Voice Properties: pitch, pitch range, voice type, volume level
Speech Content: language, register, vocabulary complexity, purpose
Speech Delivery: emotion, clarity, rate, style, emotional vocalizations
Environment: background, recording quality, channel typeExample configuration:
Speech Enhancement Metrics
Speech Enhancement Metrics
Evaluate quality using reference SE models:Output Keys:
se_si_snr, se_ci_sdr, se_sar, se_sdrPAM (Prompting Audio-Language Models)
PAM (Prompting Audio-Language Models)
Quality assessment using audio-language model prompting:Output Key:
pamDependent Metrics
Dependent metrics require a matching reference audio file for comparison. These provide precise measurements of differences between generated and reference audio.Requirements
Dependent metrics need both:
--pred: Your generated/predicted audio--gt: Ground truth reference audio with matching file IDs
Example Metrics
- MCD & F0
- Signal Metrics
- PESQ & STOI
- Discrete Speech
Mel Cepstral Distortion and F0 analysis for voice conversion and TTS.Output Keys:
mcd, f0_corr, f0_rmseSet
dtw: true for TTS evaluation, dtw: false for codec evaluationPerceptual Audio Metrics
DPAM - Deep Perceptual Audio Metric
DPAM - Deep Perceptual Audio Metric
Learned perceptual distance between audio pairs:Output Key:
dpam_distanceCDPAM - Contrastive DPAM
CDPAM - Contrastive DPAM
Contrastive learning-based perceptual metric:Output Key:
cdpam_distanceLog-WMSE
Log-WMSE
Frequency-weighted mean squared error:Output Key:
log_wmseNon-match Metrics
Non-match metrics use references that don’t need to match the evaluated audio exactly. This includes text transcriptions, speaker embeddings, or semantic descriptions.Reference Types
Text
Transcription or description
Embeddings
Speaker or emotion vectors
Semantic
Content-based comparison
Example Metrics
- WER Metrics
- Speaker Similarity
- TorchAudio SQUIM
- ASR Match
Word/Character Error Rate using ASR systems.Output Keys:
{model}_wer_delete, {model}_wer_insert, {model}_wer_replace, {model}_wer_equalAdvanced Non-match Metrics
Uni-VERSA with References
Uni-VERSA with References
Versatile assessment with different reference types:Output Key:
universa_scoreCLAP Score
CLAP Score
Contrastive Language-Audio Pretraining similarity:Output Key:
clap_scoreEmotion Similarity
Emotion Similarity
Compare emotional content using emotion2vec:Output Key:
emotion_similarityDistributional Metrics
Distributional metrics evaluate entire datasets rather than individual samples. They measure statistical properties and distribution characteristics.Example Metrics
- FAD
- KL Divergence
- Density & Coverage
Frechet Audio Distance - measures distribution similarity.Available Embeddings:
default/clap-laion-audioclap-2023,clap-laion-musicvggishmert-{layer_num}(1-12)wav2vec2-base-{layer_num}(1-12)wav2vec2-large-{layer_num}(1-24)hubert-base/large-{layer_num}wavlm-base/large-{layer_num}whisper-{size}(tiny/small/base/medium/large)dac,encodec-24k,encodec-48kcdpam-acoustic,cdpam-music
fad_overall, fad_r2Choosing the Right Metric Type
Use this decision guide to select appropriate metrics:Quick Quality Check
Use Independent metrics like DNSMOS, UTMOS, or NISQA
Precise Comparison
Use Dependent metrics like MCD, PESQ, or STOI
Content Verification
Use Non-match metrics like WER or Speaker Similarity
Dataset Analysis
Use Distributional metrics like FAD
Next Steps
Configuration
Learn how to configure metrics in YAML
Input Formats
Understand supported input formats
Metrics Reference
Browse all available metrics
Quickstart
See metric configurations in action