Skip to main content
This guide walks you through running your first audio evaluation with VERSA. You’ll learn the core workflow and see real results in minutes.

Your First Evaluation

Let’s evaluate audio quality using the built-in test samples.
1

Prepare Your Audio Files

VERSA supports three input formats:
Place audio files in directories:
reference_audio/
├── sample1.wav
├── sample2.wav
└── sample3.wav

generated_audio/
├── sample1.wav
├── sample2.wav
└── sample3.wav
Use --io dir when running the scorer.
2

Choose Your Metrics

Create or use an existing configuration file. Start with the speech example:
egs/speech.yaml
# Mel Cepstral Distortion and F0 metrics
- name: mcd_f0
  f0min: 40
  f0max: 800
  mcep_shift: 5
  mcep_fftl: 1024
  mcep_dim: 39
  mcep_alpha: 0.466

# Signal quality metrics (SDR, SAR, SIR, CI-SDR, SI-SNR)
- name: signal_metric

# PESQ (Perceptual Evaluation of Speech Quality)
- name: pesq

# STOI (Short-Time Objective Intelligibility)
- name: stoi

# Speech BERT Score and BLEU
- name: discrete_speech

# MOS predictors (UTMOS, DNSMOS, PLCMOS)
- name: pseudo_mos
  predictor_types: ["utmos", "dnsmos", "plcmos"]

# Speaker similarity
- name: speaker
  model_tag: default
Explore other configurations in the egs/ directory for different use cases: singing.yaml, general.yaml, or individual metrics in egs/separate_metrics/.
3

Run the Evaluation

Execute the scorer with your configuration:
python versa/bin/scorer.py \
    --score_config egs/speech.yaml \
    --gt test/test_samples/test1 \
    --pred test/test_samples/test2 \
    --output_file my_results \
    --io dir
You’ll see progress output:
INFO: The number of utterances = 1
INFO: Processing metrics...
INFO: Summary: {
  'mcd': 5.045226506332897,
  'f0rmse': 20.281004489942777,
  'f0corr': -0.07540903652440145,
  'pesq': 1.5722705125808716,
  'stoi': 0.007625108859647406,
  'utmos': 1.9074374437332153,
  'spk_similarity': 0.895357072353363
}
4

Review the Results

Results are saved to my_results.txt with detailed per-utterance scores:
my_results.txt
utterance_id mcd=5.045 f0rmse=20.281 pesq=1.572 stoi=0.008 utmos=1.907

Common Evaluation Scenarios

Evaluating Without Reference Audio

Some metrics don’t require reference audio (independent metrics):
python versa/bin/scorer.py \
    --score_config egs/separate_metrics/utmos2.yaml \
    --pred generated_audio/ \
    --output_file results \
    --io dir
# No reference needed - evaluates predicted audio only
- name: pseudo_mos
  predictor_types: ["utmos", "dnsmos"]
  
- name: nisqa

- name: vad

- name: speaking_rate

Evaluating with Text Transcriptions

Include text information for ASR-based metrics:
1

Prepare Text File

Create a text file with transcriptions:
transcriptions.txt
sample1 THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
sample2 SPEECH RECOGNITION EVALUATION METRICS
sample3 VERSATILE EVALUATION OF SPEECH AND AUDIO
2

Run with Text Input

python versa/bin/scorer.py \
    --score_config egs/separate_metrics/wer.yaml \
    --pred generated_audio.scp \
    --gt reference_audio.scp \
    --text transcriptions.txt \
    --output_file wer_results \
    --io soundfile

Using the Command-Line Interface

VERSA installs a console command for convenience:
versa-score \
    --score_config egs/speech.yaml \
    --pred audio.scp \
    --gt reference.scp \
    --output_file results

Python API Usage

For programmatic evaluation, use the Python API directly:
import yaml
from versa.scorer_shared import (
    audio_loader_setup,
    list_scoring,
    load_score_modules,
    load_summary
)

# Load your audio files
gen_files = audio_loader_setup("generated_audio/", io="dir")
gt_files = audio_loader_setup("reference_audio/", io="dir")

# Load configuration
with open("egs/speech.yaml", "r") as f:
    score_config = yaml.full_load(f)

# Initialize scoring modules
score_modules = load_score_modules(
    score_config,
    use_gt=True,
    use_gpu=False
)

# Run evaluation
score_info = list_scoring(
    gen_files,
    score_modules,
    gt_files,
    output_file="results",
    io="dir"
)

# Get summary statistics
summary = load_summary(score_info)
print(f"Average PESQ: {summary['pesq']:.3f}")
print(f"Average STOI: {summary['stoi']:.3f}")
print(f"Average UTMOS: {summary['utmos']:.3f}")

Using Individual Metrics

Import and use metrics directly:
import soundfile as sf
from versa import pesq_metric, stoi_metric, pseudo_mos_setup, pseudo_mos_metric

# Load audio
ref_audio, sr = sf.read("reference.wav")
gen_audio, sr = sf.read("generated.wav")

# Calculate PESQ (requires reference)
pesq_score = pesq_metric(gen_audio, ref_audio, sr)
print(f"PESQ: {pesq_score:.3f}")

# Calculate STOI (requires reference)
stoi_score = stoi_metric(gen_audio, ref_audio, sr)
print(f"STOI: {stoi_score:.3f}")

# Calculate UTMOS (no reference needed)
predictor = pseudo_mos_setup({"predictor_types": ["utmos"]})
utmos_score = pseudo_mos_metric(gen_audio, sr, predictor, "utmos")
print(f"UTMOS: {utmos_score:.3f}")

Batch Evaluation Examples

Evaluate Multiple Configurations

# Test different metric combinations
for config in egs/separate_metrics/*.yaml; do
    echo "Evaluating with $(basename $config)"
    python versa/bin/scorer.py \
        --score_config "$config" \
        --pred audio.scp \
        --gt reference.scp \
        --output_file "results_$(basename $config .yaml)"
done

GPU-Accelerated Evaluation

python versa/bin/scorer.py \
    --score_config egs/speech_gpu.yaml \
    --pred audio.scp \
    --gt reference.scp \
    --output_file gpu_results \
    --use_gpu true \
    --rank 0
GPU Memory: Some metrics (especially LLM-based ones like Qwen2-Audio) require significant GPU memory. Monitor your GPU usage with nvidia-smi.

Understanding Configuration Files

Each metric in the YAML configuration can have custom parameters:
# Basic metric (no parameters)
- name: pesq

# Metric with parameters
- name: mcd_f0
  f0min: 40          # Minimum F0 in Hz
  f0max: 800         # Maximum F0 in Hz
  mcep_dim: 39       # Number of MCEP dimensions
  dtw: false         # Disable dynamic time warping

# Metric with multiple variants
- name: pseudo_mos
  predictor_types: 
    - utmos
    - dnsmos
    - plcmos
    - dnsmos_pro_bvcc
  predictor_args:
    utmos:
      fs: 16000      # Sample rate
    dnsmos:
      fs: 16000
Explore configurations: Check the egs/ directory for pre-configured setups:
  • speech.yaml - Comprehensive speech evaluation
  • singing.yaml - Singing voice synthesis
  • speech_cpu.yaml - CPU-only metrics
  • speech_gpu.yaml - GPU-accelerated metrics
  • separate_metrics/ - Individual metric examples

Distributed Evaluation (Advanced)

For large datasets, use Slurm-based distributed evaluation:
1

Prepare SCP Files

Ensure your audio is indexed in SCP format:
# List all files with IDs
find /path/to/audio -name "*.wav" | \
    awk -F/ '{print $NF, $0}' | \
    sed 's/.wav//' > audio.scp
2

Launch Distributed Jobs

./launch_slurm.sh \
  generated_audio.scp \
  reference_audio.scp \
  results_dir/ \
  32  # Number of parallel jobs
3

Aggregate Results

# Combine CPU metric results
cat results_dir/result/*.result.cpu.txt > final_results_cpu.txt

# Combine GPU metric results
cat results_dir/result/*.result.gpu.txt > final_results_gpu.txt
4

Visualize Results

python scripts/show_result.py final_results_cpu.txt
python scripts/show_result.py final_results_gpu.txt

Next Steps

Now that you’ve run your first evaluation, explore:

Metrics Reference

Learn about all 90+ available metrics and their use cases

Configuration Guide

Deep dive into metric configuration and customization

Visualization

Create interactive visualizations of your results

API Reference

Explore the complete Python API documentation
Try the interactive demo: Experiment with VERSA in Google Colab without any local installation: VERSA Colab Demo

Build docs developers (and LLMs) love