Configuration

VERSA uses YAML configuration files to define which metrics to compute and how to configure them. This page explains the structure and options available.

Configuration Basics

File Structure

A VERSA configuration file is a YAML list where each item defines a metric:

# Basic structure
- name: metric_name
  parameter1: value1
  parameter2: value2

- name: another_metric
  parameter: value

All configuration examples in this guide are taken from real configs in the egs/ directory.

Required Fields

Every metric configuration must have:

name: The metric identifier (e.g., pseudo_mos, mcd_f0, pesq)

Optional Fields

Depending on the metric, you can specify:

Model parameters
Processing options
Model tags/versions
Cache directories
Device settings

Complete Configuration Examples

TTS
Codec
Speech Enhancement
Singing Voice

Full configuration for Text-to-Speech evaluation from egs/demo/tts.yaml:

# TTS Metrics Configuration for VERSA

# mcd f0 related metrics
#  -- mcd: mel cepstral distortion
#  -- f0_corr: f0 correlation
#  -- f0_rmse: f0 root mean square error
- name: mcd_f0
  f0min: 40
  f0max: 800
  mcep_shift: 5
  mcep_fftl: 1024
  mcep_dim: 39
  mcep_alpha: 0.466
  seq_mismatch_tolerance: 0.1
  power_threshold: -20
  dtw: true

# discrete speech metrics
# -- speech_bert: speech bert score
# -- speech_bleu: speech bleu score
# -- speech_token_distance: speech token distance score
- name: discrete_speech

# nomad (reference-based) metric
# -- nomad: nomad reference-based model
- name: nomad
  model_cache: versa_cache/nomad_pt-models

# An overall model on MOS-bench from Sheet toolkit
# --sheet_ssqa: the mos prediction from sheet_ssqa
- name: sheet_ssqa

# pseudo subjective metrics
# -- utmos: UT-MOS score
- name: pseudo_mos
  predictor_types: ["utmos"]
  predictor_args:
    utmos:
      fs: 16000

# Word error rate with OpenAI-Whisper model
# -- whisper_wer: word error rate of openai-whisper
- name: whisper_wer
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic

# Audiobox Aesthetics
- name: audiobox_aesthetics
  batch_size: 1
  cache_dir: versa_cache/audiobox

# ASR-match calculating
# --asr_match_error_rate: correct matching words/character counts
- name: asr_match
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic

# speaker related metrics
# -- spk_similarity: speaker cosine similarity
- name: speaker
  model_tag: default

# asvspoof related metrics
# -- asvspoof_score: evaluate deepfake likelihood
- name: asvspoof_score

Configuration for audio codec evaluation from egs/speech.yaml:

# codec example yaml config

# mcd f0 related metrics
- name: mcd_f0
  f0min: 40
  f0max: 800
  mcep_shift: 5
  mcep_fftl: 1024
  mcep_dim: 39
  mcep_alpha: 0.466
  seq_mismatch_tolerance: 0.1
  power_threshold: -20
  dtw: false

# signal related metrics
# -- sir: signal to interference ratio
# -- sar: signal to artifact ratio
# -- sdr: signal to distortion ratio
# -- ci-sdr: scale-invariant signal to distortion ratio
# -- si-snri: scale-invariant signal to noise ratio improvement
- name: signal_metric

# pesq related metrics
- name: pesq

# stoi related metrics
- name: stoi

# discrete speech metrics
- name: discrete_speech

# pseudo subjective metrics
- name: pseudo_mos
  predictor_types: ["utmos", "dnsmos", "plcmos", "dnsmos_pro_bvcc", "dnsmos_pro_nisqa", "dnsmos_pro_vcc2018"]
  predictor_args:
    utmos:
      fs: 16000
    dnsmos:
      fs: 16000
    plcmos:
      fs: 16000

# speaker related metrics
- name: speaker
  model_tag: default

# torchaudio-squim
- name: squim_ref
- name: squim_no_ref

# Sheet SSQA model
- name: sheet_ssqa

# Speech Enhancement-based Metrics
- name: se_snr
  model_tag: default

# DPAM and CDPAM distance metrics
- name: dpam
- name: cdpam

Configuration for speech enhancement from egs/demo/se.yaml:

# Speech Enhancement Metrics

- name: signal_metric
- name: pesq
- name: stoi

- name: pseudo_mos
  predictor_types: ["dnsmos"]
  predictor_args:
    dnsmos:
      fs: 16000

- name: squim_ref
- name: squim_no_ref

Configuration for singing voice synthesis from egs/demo/svs.yaml:

# Singing Voice Synthesis Metrics

- name: mcd_f0
  f0min: 40
  f0max: 800
  mcep_shift: 5
  mcep_fftl: 1024
  mcep_dim: 39
  mcep_alpha: 0.466
  seq_mismatch_tolerance: 0.1
  power_threshold: -20
  dtw: true

- name: pseudo_mos
  predictor_types: ["singmos", "singmos_pro"]

- name: sheet_ssqa

- name: singer
  model_tag: default

Metric-Specific Configuration

MCD & F0 Metrics

For voice conversion and TTS evaluation:

- name: mcd_f0
  f0min: 40              # Minimum F0 in Hz
  f0max: 800             # Maximum F0 in Hz
  mcep_shift: 5          # Frame shift in ms
  mcep_fftl: 1024        # FFT length
  mcep_dim: 39           # MCEP dimension
  mcep_alpha: 0.466      # All-pass constant (0.466 for 16kHz)
  seq_mismatch_tolerance: 0.1  # Length mismatch tolerance
  power_threshold: -20   # Power threshold in dB
  dtw: true              # Use Dynamic Time Warping

Parameter Details

F0 Range: Set based on expected voice range (typically 40-800 for speech)mcep_alpha: Frequency warping parameter

0.466 for 16 kHz sampling
0.41 for 12 kHz sampling
0.35 for 8 kHz sampling

DTW:

true for TTS (allows length differences)
false for codec (expects exact alignment)

Pseudo MOS Metrics

Configure multiple MOS predictors:

- name: pseudo_mos
  predictor_types: ["utmos", "dnsmos", "plcmos"]
  predictor_args:
    utmos:
      fs: 16000
    dnsmos:
      fs: 16000
    plcmos:
      fs: 16000

Available predictors: utmos, dnsmos, dnsmos_p808, plcmos, singmos, singmos_pro, dnsmos_pro_bvcc, dnsmos_pro_nisqa, dnsmos_pro_vcc2018

ASR-based Metrics

Configure word error rate and transcription metrics:

# ESPnet ASR
- name: espnet_wer
  model_tag: default
  beam_size: 5
  text_cleaner: whisper_basic

# OWSM
- name: owsm_wer
  model_tag: default
  beam_size: 5
  text_cleaner: whisper_basic

# Whisper
- name: whisper_wer
  model_tag: default
  beam_size: 1
  text_cleaner: whisper_basic

Model Tags

ESPnet models: Check ESPnet HuggingFace for available modelsWhisper models: tiny, base, small, medium, large, large-v2, large-v3OWSM models: Multilingual speech models from ESPnet

Speaker/Singer Metrics

Configure embedding-based similarity:

# Speaker similarity
- name: speaker
  model_tag: default

# Singer similarity  
- name: singer
  model_tag: default

Distributional Metrics

Configure FAD with different embeddings:

- name: fad
  fad_embedding: clap-laion-audio
  cache_dir: versa_cache/fad
  use_inf: true
  io: kaldi

CLAP Models
SSL Models
Other Models

fad_embedding: clap-laion-audio  # Default
fad_embedding: clap-2023
fad_embedding: clap-laion-music

fad_embedding: wav2vec2-base-6
fad_embedding: wav2vec2-large-12
fad_embedding: hubert-base-9
fad_embedding: hubert-large-18
fad_embedding: wavlm-base-7
fad_embedding: wavlm-large-15

fad_embedding: vggish
fad_embedding: whisper-small
fad_embedding: dac
fad_embedding: encodec-24k
fad_embedding: cdpam-acoustic

Audio-Language Models

Configure large models with caching:

# Audiobox Aesthetics
- name: audiobox_aesthetics
  batch_size: 1
  cache_dir: versa_cache/audiobox

# Qwen2 Audio metrics
- name: qwen2_speaker_gender_metric
- name: qwen2_voice_pitch_metric
- name: qwen2_speech_emotion_metric

Using Configuration Files

Command Line Usage

versa-scorer \
  --pred generated.scp \
  --gt reference.scp \
  --score_config configs/tts.yaml \
  --output_file results.json

Referencing Configs

You can use the provided example configs directly:

# Use a demo config
versa-scorer \
  --pred outputs/pred.scp \
  --gt data/gt.scp \
  --score_config egs/demo/tts.yaml \
  --output_file results.json

# Use a specific metric config
versa-scorer \
  --pred outputs/pred.scp \
  --score_config egs/separate_metrics/pseudo_mos.yaml \
  --output_file results.json

Configuration Tips

Start with Demo Configs

Use the configs in egs/demo/ as starting points:

tts.yaml - Text-to-Speech
se.yaml - Speech Enhancement
svs.yaml - Singing Voice Synthesis
codec.yaml - Audio Codec

Mix and Match Metrics

Combine metrics from different categories:

# Independent metrics (no reference needed)
- name: pseudo_mos
  predictor_types: ["utmos"]

# Dependent metrics (with reference)
- name: pesq
- name: stoi

# Non-match metrics (with text)
- name: whisper_wer
  model_tag: default
  beam_size: 1

Cache Management

Set cache directories to avoid re-downloading models:

- name: fad
  cache_dir: versa_cache/fad

- name: audiobox_aesthetics
  cache_dir: versa_cache/audiobox

- name: nomad
  model_cache: versa_cache/nomad_pt-models

GPU vs CPU Configs

VERSA provides separate configs optimized for different hardware:GPU: egs/speech_gpu.yaml - Includes heavy models
CPU: egs/speech_cpu.yaml - Lighter, CPU-friendly modelsUse --use_gpu true flag when running GPU configs.

Validation and Debugging

Check Configuration

VERSA will validate your config and warn you about:

Missing reference files when dependent metrics are specified
Unsupported parameter combinations
Missing required fields

Common Issues

Reference Required: If you configure dependent metrics (like mcd_f0, pesq, stoi) without providing --gt, VERSA will skip those metrics and log a warning.

# This will skip MCD/F0 metrics
versa-scorer \
  --pred outputs.scp \
  --score_config config_with_mcd.yaml  # Contains mcd_f0 but no --gt provided

Advanced Configuration

Custom Sampling Rates

Some metrics need specific sampling rates:

- name: pseudo_mos
  predictor_types: ["utmos"]
  predictor_args:
    utmos:
      fs: 16000  # Force 16kHz resampling

Model Selection

Many metrics support custom model tags:

# Use specific ESPnet model
- name: speaker
  model_tag: espnet/spkrec_model_name

# Use specific Whisper size
- name: whisper_wer
  model_tag: large-v3

Batch Processing

Control batch sizes for large models:

- name: audiobox_aesthetics
  batch_size: 4  # Process 4 samples at once

Next Steps

Metric Types

Understand metric categories

Input Formats

Learn about supported input formats

Example Configs

Browse all example configurations

Metrics Reference

Detailed metric documentation

Get Started

Core Concepts

Usage Guides

Metrics Reference

Advanced

Configuration

Configuration Basics

File Structure

Required Fields

Optional Fields

Complete Configuration Examples

Metric-Specific Configuration

MCD & F0 Metrics

Pseudo MOS Metrics

ASR-based Metrics

Speaker/Singer Metrics

Distributional Metrics

Audio-Language Models

Using Configuration Files

Command Line Usage

Referencing Configs

Configuration Tips

Validation and Debugging

Check Configuration

Common Issues

Advanced Configuration

Custom Sampling Rates

Model Selection

Batch Processing

Next Steps

Metric Types

Input Formats

Example Configs

Metrics Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Metrics Reference

Advanced

​Configuration Basics

​File Structure

​Required Fields

​Optional Fields

​Complete Configuration Examples

​Metric-Specific Configuration

​MCD & F0 Metrics

​Pseudo MOS Metrics

​ASR-based Metrics

​Speaker/Singer Metrics

​Distributional Metrics

​Audio-Language Models

​Using Configuration Files

​Command Line Usage

​Referencing Configs

​Configuration Tips

​Validation and Debugging

​Check Configuration

​Common Issues

​Advanced Configuration

​Custom Sampling Rates

​Model Selection

​Batch Processing

​Next Steps

Metric Types

Input Formats

Example Configs

Metrics Reference

Build docs developers (and LLMs) love

Configuration Basics

File Structure

Required Fields

Optional Fields

Complete Configuration Examples

Metric-Specific Configuration

MCD & F0 Metrics

Pseudo MOS Metrics

ASR-based Metrics

Speaker/Singer Metrics

Distributional Metrics

Audio-Language Models

Using Configuration Files

Command Line Usage

Referencing Configs

Configuration Tips

Validation and Debugging

Check Configuration

Common Issues

Advanced Configuration

Custom Sampling Rates

Model Selection

Batch Processing

Next Steps