Metric Types

VERSA organizes metrics into four distinct categories based on their reference requirements and evaluation approach. Understanding these types helps you choose the right metrics for your evaluation needs.

Overview

The four metric categories are:

Independent

No reference required - evaluate audio directly

Dependent

Require matching reference audio for comparison

Non-match

Use non-matching references like text or embeddings

Distributional

Evaluate distributions across entire datasets

Independent Metrics

Independent metrics evaluate audio quality without requiring any reference. These are ideal for quick assessments and scenarios where ground truth is unavailable.

Common Use Cases

MOS (Mean Opinion Score) prediction
Speech quality assessment
Audio property detection
Speaker characteristic analysis

Example Metrics

DNSMOS
NISQA
UTMOS
VAD

Deep Noise Suppression MOS Score - predicts perceived quality without reference.

- name: pseudo_mos
  predictor_types: ["dnsmos"]
  predictor_args:
    dnsmos:
      fs: 16000

Output Keys: dnsmos_overall, dnsmos_p808

Non-intrusive Speech Quality and Naturalness Assessment.

- name: nisqa

Output Keys: nisqa_mos_pred, nisqa_noi_pred, nisqa_dis_pred, nisqa_col_pred, nisqa_loud_pred

UTokyo-SaruLab System for MOS prediction.

- name: pseudo_mos
  predictor_types: ["utmos"]
  predictor_args:
    utmos:
      fs: 16000

Output Key: utmos

Voice Activity Detection - identifies speech segments.

- name: vad

Output Key: vad_info

Most independent metrics in VERSA are auto-installed and ready to use without additional setup.

Advanced Independent Metrics

VERSA includes sophisticated independent metrics powered by audio-language models:

Qwen2 Audio Metrics

The Qwen2 Audio model provides detailed analysis across multiple dimensions:Speaker Characteristics: gender, age, count, speech impairment
Voice Properties: pitch, pitch range, voice type, volume level
Speech Content: language, register, vocabulary complexity, purpose
Speech Delivery: emotion, clarity, rate, style, emotional vocalizations
Environment: background, recording quality, channel typeExample configuration:

- name: qwen2_speaker_gender_metric
- name: qwen2_voice_pitch_metric
- name: qwen2_speech_emotion_metric

Speech Enhancement Metrics

Evaluate quality using reference SE models:

- name: se_snr
  model_tag: default

Output Keys: se_si_snr, se_ci_sdr, se_sar, se_sdr

PAM (Prompting Audio-Language Models)

Quality assessment using audio-language model prompting:

- name: pam

Output Key: pam

Dependent Metrics

Dependent metrics require a matching reference audio file for comparison. These provide precise measurements of differences between generated and reference audio.

Requirements

Dependent metrics need both:

--pred: Your generated/predicted audio
--gt: Ground truth reference audio with matching file IDs

Example Metrics

MCD & F0
Signal Metrics
PESQ & STOI
Discrete Speech

Mel Cepstral Distortion and F0 analysis for voice conversion and TTS.

- name: mcd_f0
  f0min: 40
  f0max: 800
  mcep_shift: 5
  mcep_fftl: 1024
  mcep_dim: 39
  mcep_alpha: 0.466
  seq_mismatch_tolerance: 0.1
  power_threshold: -20
  dtw: true

Output Keys: mcd, f0_corr, f0_rmse

Set dtw: true for TTS evaluation, dtw: false for codec evaluation

Various signal-to-noise and distortion ratios.

- name: signal_metric

Output Keys: sir, sar, sdr, ci-sdr, si-snr

Industry-standard perceptual quality metrics.

- name: pesq
- name: stoi

Output Keys: pesq, stoi

Token-level comparison using discrete representations.

- name: discrete_speech

Output Keys: speech_bert, speech_bleu, speech_token_distance

Perceptual Audio Metrics

DPAM - Deep Perceptual Audio Metric

Learned perceptual distance between audio pairs:

- name: dpam

Output Key: dpam_distance

CDPAM - Contrastive DPAM

Contrastive learning-based perceptual metric:

- name: cdpam

Output Key: cdpam_distance

Log-WMSE

Frequency-weighted mean squared error:

- name: log_wmse

Output Key: log_wmse

Non-match Metrics

Non-match metrics use references that don’t need to match the evaluated audio exactly. This includes text transcriptions, speaker embeddings, or semantic descriptions.

Reference Types

Text

Transcription or description

Embeddings

Speaker or emotion vectors

Semantic

Content-based comparison

Example Metrics

WER Metrics
Speaker Similarity
TorchAudio SQUIM
ASR Match

Word/Character Error Rate using ASR systems.

# ESPnet ASR
- name: espnet_wer
  model_tag: default
  beam_size: 5
  text_cleaner: whisper_basic

# OWSM (Open Whisper-style Speech Model)
- name: owsm_wer
  model_tag: default
  beam_size: 5
  text_cleaner: whisper_basic

# OpenAI Whisper
- name: whisper_wer
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic

Output Keys: {model}_wer_delete, {model}_wer_insert, {model}_wer_replace, {model}_wer_equal

Compare speaker embeddings regardless of content.

- name: speaker
  model_tag: default

Output Key: spk_similarity

Uses ESPnet speaker verification models. Check ESPnet HuggingFace for available models.

Reference-based MOS with non-matching reference.

- name: squim_ref

Output Key: torch_squim_mos

Calculate correct word/character match rate.

- name: asr_match
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic

Output Key: asr_match_error_rate

Advanced Non-match Metrics

Uni-VERSA with References

Versatile assessment with different reference types:

# Text reference
- name: universa_textref

# Audio + Text (full reference)
- name: universa_fullref

Output Key: universa_score

CLAP Score

Contrastive Language-Audio Pretraining similarity:

- name: clap_score

Output Key: clap_score

Emotion Similarity

Compare emotional content using emotion2vec:

- name: emo2vec_similarity

Output Key: emotion_similarity

Distributional Metrics

Distributional metrics evaluate entire datasets rather than individual samples. They measure statistical properties and distribution characteristics.

Distributional metrics require a corpus of audio files and are marked as “in verifying” status. Use with caution in production.

Example Metrics

FAD
KL Divergence
Density & Coverage

Frechet Audio Distance - measures distribution similarity.

- name: fad
  fad_embedding: default
  cache_dir: versa_cache/fad
  use_inf: true
  io: kaldi

Available Embeddings:

default / clap-laion-audio
clap-2023, clap-laion-music
vggish
mert-{layer_num} (1-12)
wav2vec2-base-{layer_num} (1-12)
wav2vec2-large-{layer_num} (1-24)
hubert-base/large-{layer_num}
wavlm-base/large-{layer_num}
whisper-{size} (tiny/small/base/medium/large)
dac, encodec-24k, encodec-48k
cdpam-acoustic, cdpam-music

Output Keys: fad_overall, fad_r2

Kullback-Leibler divergence on embedding distributions.

- name: kl_embedding

Output Key: kl_embedding

Audio generation quality metrics.

- name: audio_density_coverage

Output Keys: audio_density, audio_coverage

Choosing the Right Metric Type

Use this decision guide to select appropriate metrics:

Quick Quality Check

Use Independent metrics like DNSMOS, UTMOS, or NISQA

Precise Comparison

Use Dependent metrics like MCD, PESQ, or STOI

Content Verification

Use Non-match metrics like WER or Speaker Similarity

Dataset Analysis

Use Distributional metrics like FAD

Next Steps

Configuration

Learn how to configure metrics in YAML

Input Formats

Understand supported input formats

Metrics Reference

Browse all available metrics

Quickstart

See metric configurations in action

Get Started

Core Concepts

Usage Guides

Metrics Reference

Advanced

Metric Types

Overview

Independent

Dependent

Non-match

Distributional

Independent Metrics

Common Use Cases

Example Metrics

Advanced Independent Metrics

Dependent Metrics

Requirements

Example Metrics

Perceptual Audio Metrics

Non-match Metrics

Reference Types

Text

Embeddings

Semantic

Example Metrics

Advanced Non-match Metrics

Distributional Metrics

Example Metrics

Choosing the Right Metric Type

Quick Quality Check

Precise Comparison

Content Verification

Dataset Analysis

Next Steps

Configuration

Input Formats

Metrics Reference

Quickstart

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Metrics Reference

Advanced

​Overview

Independent

Dependent

Non-match

Distributional

​Independent Metrics

​Common Use Cases

​Example Metrics

​Advanced Independent Metrics

​Dependent Metrics

​Requirements

​Example Metrics

​Perceptual Audio Metrics

​Non-match Metrics

​Reference Types

Text

Embeddings

Semantic

​Example Metrics

​Advanced Non-match Metrics

​Distributional Metrics

​Example Metrics

​Choosing the Right Metric Type

Quick Quality Check

Precise Comparison

Content Verification

Dataset Analysis

​Next Steps

Configuration

Input Formats

Metrics Reference

Quickstart

Build docs developers (and LLMs) love

Overview

Independent Metrics

Common Use Cases

Example Metrics

Advanced Independent Metrics

Dependent Metrics

Requirements

Example Metrics

Perceptual Audio Metrics

Non-match Metrics

Reference Types

Example Metrics

Advanced Non-match Metrics

Distributional Metrics

Example Metrics

Choosing the Right Metric Type

Next Steps