Skip to main content
VERSA uses YAML configuration files to define which metrics to compute and how to configure them. This page explains the structure and options available.

Configuration Basics

File Structure

A VERSA configuration file is a YAML list where each item defines a metric:
# Basic structure
- name: metric_name
  parameter1: value1
  parameter2: value2

- name: another_metric
  parameter: value
All configuration examples in this guide are taken from real configs in the egs/ directory.

Required Fields

Every metric configuration must have:
  • name: The metric identifier (e.g., pseudo_mos, mcd_f0, pesq)

Optional Fields

Depending on the metric, you can specify:
  • Model parameters
  • Processing options
  • Model tags/versions
  • Cache directories
  • Device settings

Complete Configuration Examples

Full configuration for Text-to-Speech evaluation from egs/demo/tts.yaml:
# TTS Metrics Configuration for VERSA

# mcd f0 related metrics
#  -- mcd: mel cepstral distortion
#  -- f0_corr: f0 correlation
#  -- f0_rmse: f0 root mean square error
- name: mcd_f0
  f0min: 40
  f0max: 800
  mcep_shift: 5
  mcep_fftl: 1024
  mcep_dim: 39
  mcep_alpha: 0.466
  seq_mismatch_tolerance: 0.1
  power_threshold: -20
  dtw: true

# discrete speech metrics
# -- speech_bert: speech bert score
# -- speech_bleu: speech bleu score
# -- speech_token_distance: speech token distance score
- name: discrete_speech

# nomad (reference-based) metric
# -- nomad: nomad reference-based model
- name: nomad
  model_cache: versa_cache/nomad_pt-models

# An overall model on MOS-bench from Sheet toolkit
# --sheet_ssqa: the mos prediction from sheet_ssqa
- name: sheet_ssqa

# pseudo subjective metrics
# -- utmos: UT-MOS score
- name: pseudo_mos
  predictor_types: ["utmos"]
  predictor_args:
    utmos:
      fs: 16000

# Word error rate with OpenAI-Whisper model
# -- whisper_wer: word error rate of openai-whisper
- name: whisper_wer
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic

# Audiobox Aesthetics
- name: audiobox_aesthetics
  batch_size: 1
  cache_dir: versa_cache/audiobox

# ASR-match calculating
# --asr_match_error_rate: correct matching words/character counts
- name: asr_match
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic

# speaker related metrics
# -- spk_similarity: speaker cosine similarity
- name: speaker
  model_tag: default

# asvspoof related metrics
# -- asvspoof_score: evaluate deepfake likelihood
- name: asvspoof_score

Metric-Specific Configuration

MCD & F0 Metrics

For voice conversion and TTS evaluation:
- name: mcd_f0
  f0min: 40              # Minimum F0 in Hz
  f0max: 800             # Maximum F0 in Hz
  mcep_shift: 5          # Frame shift in ms
  mcep_fftl: 1024        # FFT length
  mcep_dim: 39           # MCEP dimension
  mcep_alpha: 0.466      # All-pass constant (0.466 for 16kHz)
  seq_mismatch_tolerance: 0.1  # Length mismatch tolerance
  power_threshold: -20   # Power threshold in dB
  dtw: true              # Use Dynamic Time Warping
F0 Range: Set based on expected voice range (typically 40-800 for speech)mcep_alpha: Frequency warping parameter
  • 0.466 for 16 kHz sampling
  • 0.41 for 12 kHz sampling
  • 0.35 for 8 kHz sampling
DTW:
  • true for TTS (allows length differences)
  • false for codec (expects exact alignment)

Pseudo MOS Metrics

Configure multiple MOS predictors:
- name: pseudo_mos
  predictor_types: ["utmos", "dnsmos", "plcmos"]
  predictor_args:
    utmos:
      fs: 16000
    dnsmos:
      fs: 16000
    plcmos:
      fs: 16000
Available predictors: utmos, dnsmos, dnsmos_p808, plcmos, singmos, singmos_pro, dnsmos_pro_bvcc, dnsmos_pro_nisqa, dnsmos_pro_vcc2018

ASR-based Metrics

Configure word error rate and transcription metrics:
# ESPnet ASR
- name: espnet_wer
  model_tag: default
  beam_size: 5
  text_cleaner: whisper_basic

# OWSM
- name: owsm_wer
  model_tag: default
  beam_size: 5
  text_cleaner: whisper_basic

# Whisper
- name: whisper_wer
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic
ESPnet models: Check ESPnet HuggingFace for available modelsWhisper models: tiny, base, small, medium, large, large-v2, large-v3OWSM models: Multilingual speech models from ESPnet

Speaker/Singer Metrics

Configure embedding-based similarity:
# Speaker similarity
- name: speaker
  model_tag: default

# Singer similarity  
- name: singer
  model_tag: default

Distributional Metrics

Configure FAD with different embeddings:
- name: fad
  fad_embedding: clap-laion-audio
  cache_dir: versa_cache/fad
  use_inf: true
  io: kaldi
fad_embedding: clap-laion-audio  # Default
fad_embedding: clap-2023
fad_embedding: clap-laion-music

Audio-Language Models

Configure large models with caching:
# Audiobox Aesthetics
- name: audiobox_aesthetics
  batch_size: 1
  cache_dir: versa_cache/audiobox

# Qwen2 Audio metrics
- name: qwen2_speaker_gender_metric
- name: qwen2_voice_pitch_metric
- name: qwen2_speech_emotion_metric

Using Configuration Files

Command Line Usage

versa-scorer \
  --pred generated.scp \
  --gt reference.scp \
  --score_config configs/tts.yaml \
  --output_file results.json

Referencing Configs

You can use the provided example configs directly:
# Use a demo config
versa-scorer \
  --pred outputs/pred.scp \
  --gt data/gt.scp \
  --score_config egs/demo/tts.yaml \
  --output_file results.json

# Use a specific metric config
versa-scorer \
  --pred outputs/pred.scp \
  --score_config egs/separate_metrics/pseudo_mos.yaml \
  --output_file results.json

Configuration Tips

Use the configs in egs/demo/ as starting points:
  • tts.yaml - Text-to-Speech
  • se.yaml - Speech Enhancement
  • svs.yaml - Singing Voice Synthesis
  • codec.yaml - Audio Codec
Combine metrics from different categories:
# Independent metrics (no reference needed)
- name: pseudo_mos
  predictor_types: ["utmos"]

# Dependent metrics (with reference)
- name: pesq
- name: stoi

# Non-match metrics (with text)
- name: whisper_wer
  model_tag: default
  beam_size: 1
Set cache directories to avoid re-downloading models:
- name: fad
  cache_dir: versa_cache/fad

- name: audiobox_aesthetics
  cache_dir: versa_cache/audiobox

- name: nomad
  model_cache: versa_cache/nomad_pt-models
VERSA provides separate configs optimized for different hardware:GPU: egs/speech_gpu.yaml - Includes heavy models
CPU: egs/speech_cpu.yaml - Lighter, CPU-friendly models
Use --use_gpu true flag when running GPU configs.

Validation and Debugging

Check Configuration

VERSA will validate your config and warn you about:
  • Missing reference files when dependent metrics are specified
  • Unsupported parameter combinations
  • Missing required fields

Common Issues

Reference Required: If you configure dependent metrics (like mcd_f0, pesq, stoi) without providing --gt, VERSA will skip those metrics and log a warning.
# This will skip MCD/F0 metrics
versa-scorer \
  --pred outputs.scp \
  --score_config config_with_mcd.yaml  # Contains mcd_f0 but no --gt provided

Advanced Configuration

Custom Sampling Rates

Some metrics need specific sampling rates:
- name: pseudo_mos
  predictor_types: ["utmos"]
  predictor_args:
    utmos:
      fs: 16000  # Force 16kHz resampling

Model Selection

Many metrics support custom model tags:
# Use specific ESPnet model
- name: speaker
  model_tag: espnet/spkrec_model_name

# Use specific Whisper size
- name: whisper_wer
  model_tag: large-v3

Batch Processing

Control batch sizes for large models:
- name: audiobox_aesthetics
  batch_size: 4  # Process 4 samples at once

Next Steps

Metric Types

Understand metric categories

Input Formats

Learn about supported input formats

Example Configs

Browse all example configurations

Metrics Reference

Detailed metric documentation

Build docs developers (and LLMs) love