Skip to main content
VERSA organizes metrics into four distinct categories based on their reference requirements and evaluation approach. Understanding these types helps you choose the right metrics for your evaluation needs.

Overview

The four metric categories are:

Independent

No reference required - evaluate audio directly

Dependent

Require matching reference audio for comparison

Non-match

Use non-matching references like text or embeddings

Distributional

Evaluate distributions across entire datasets

Independent Metrics

Independent metrics evaluate audio quality without requiring any reference. These are ideal for quick assessments and scenarios where ground truth is unavailable.

Common Use Cases

  • MOS (Mean Opinion Score) prediction
  • Speech quality assessment
  • Audio property detection
  • Speaker characteristic analysis

Example Metrics

Deep Noise Suppression MOS Score - predicts perceived quality without reference.
- name: pseudo_mos
  predictor_types: ["dnsmos"]
  predictor_args:
    dnsmos:
      fs: 16000
Output Keys: dnsmos_overall, dnsmos_p808
Most independent metrics in VERSA are auto-installed and ready to use without additional setup.

Advanced Independent Metrics

VERSA includes sophisticated independent metrics powered by audio-language models:
The Qwen2 Audio model provides detailed analysis across multiple dimensions:Speaker Characteristics: gender, age, count, speech impairment
Voice Properties: pitch, pitch range, voice type, volume level
Speech Content: language, register, vocabulary complexity, purpose
Speech Delivery: emotion, clarity, rate, style, emotional vocalizations
Environment: background, recording quality, channel type
Example configuration:
- name: qwen2_speaker_gender_metric
- name: qwen2_voice_pitch_metric
- name: qwen2_speech_emotion_metric
Evaluate quality using reference SE models:
- name: se_snr
  model_tag: default
Output Keys: se_si_snr, se_ci_sdr, se_sar, se_sdr
Quality assessment using audio-language model prompting:
- name: pam
Output Key: pam

Dependent Metrics

Dependent metrics require a matching reference audio file for comparison. These provide precise measurements of differences between generated and reference audio.

Requirements

Dependent metrics need both:
  • --pred: Your generated/predicted audio
  • --gt: Ground truth reference audio with matching file IDs

Example Metrics

Mel Cepstral Distortion and F0 analysis for voice conversion and TTS.
- name: mcd_f0
  f0min: 40
  f0max: 800
  mcep_shift: 5
  mcep_fftl: 1024
  mcep_dim: 39
  mcep_alpha: 0.466
  seq_mismatch_tolerance: 0.1
  power_threshold: -20
  dtw: true
Output Keys: mcd, f0_corr, f0_rmse
Set dtw: true for TTS evaluation, dtw: false for codec evaluation

Perceptual Audio Metrics

Learned perceptual distance between audio pairs:
- name: dpam
Output Key: dpam_distance
Contrastive learning-based perceptual metric:
- name: cdpam
Output Key: cdpam_distance
Frequency-weighted mean squared error:
- name: log_wmse
Output Key: log_wmse

Non-match Metrics

Non-match metrics use references that don’t need to match the evaluated audio exactly. This includes text transcriptions, speaker embeddings, or semantic descriptions.

Reference Types

Text

Transcription or description

Embeddings

Speaker or emotion vectors

Semantic

Content-based comparison

Example Metrics

Word/Character Error Rate using ASR systems.
# ESPnet ASR
- name: espnet_wer
  model_tag: default
  beam_size: 5
  text_cleaner: whisper_basic

# OWSM (Open Whisper-style Speech Model)
- name: owsm_wer
  model_tag: default
  beam_size: 5
  text_cleaner: whisper_basic

# OpenAI Whisper
- name: whisper_wer
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic
Output Keys: {model}_wer_delete, {model}_wer_insert, {model}_wer_replace, {model}_wer_equal

Advanced Non-match Metrics

Versatile assessment with different reference types:
# Text reference
- name: universa_textref

# Audio + Text (full reference)
- name: universa_fullref
Output Key: universa_score
Contrastive Language-Audio Pretraining similarity:
- name: clap_score
Output Key: clap_score
Compare emotional content using emotion2vec:
- name: emo2vec_similarity
Output Key: emotion_similarity

Distributional Metrics

Distributional metrics evaluate entire datasets rather than individual samples. They measure statistical properties and distribution characteristics.
Distributional metrics require a corpus of audio files and are marked as “in verifying” status. Use with caution in production.

Example Metrics

Frechet Audio Distance - measures distribution similarity.
- name: fad
  fad_embedding: default
  cache_dir: versa_cache/fad
  use_inf: true
  io: kaldi
Available Embeddings:
  • default / clap-laion-audio
  • clap-2023, clap-laion-music
  • vggish
  • mert-{layer_num} (1-12)
  • wav2vec2-base-{layer_num} (1-12)
  • wav2vec2-large-{layer_num} (1-24)
  • hubert-base/large-{layer_num}
  • wavlm-base/large-{layer_num}
  • whisper-{size} (tiny/small/base/medium/large)
  • dac, encodec-24k, encodec-48k
  • cdpam-acoustic, cdpam-music
Output Keys: fad_overall, fad_r2

Choosing the Right Metric Type

Use this decision guide to select appropriate metrics:

Quick Quality Check

Use Independent metrics like DNSMOS, UTMOS, or NISQA

Precise Comparison

Use Dependent metrics like MCD, PESQ, or STOI

Content Verification

Use Non-match metrics like WER or Speaker Similarity

Dataset Analysis

Use Distributional metrics like FAD

Next Steps

Configuration

Learn how to configure metrics in YAML

Input Formats

Understand supported input formats

Metrics Reference

Browse all available metrics

Quickstart

See metric configurations in action

Build docs developers (and LLMs) love