This guide walks you through running your first audio evaluation with VERSA. You’ll learn the core workflow and see real results in minutes.
Your First Evaluation
Let’s evaluate audio quality using the built-in test samples.
Prepare Your Audio Files
VERSA supports three input formats: Directory (Simple)
SCP Files (Kaldi-style)
Kaldi ARK (ESPnet)
Place audio files in directories: reference_audio/
├── sample1.wav
├── sample2.wav
└── sample3.wav
generated_audio/
├── sample1.wav
├── sample2.wav
└── sample3.wav
Use --io dir when running the scorer. Create .scp files mapping IDs to file paths: sample1 /path/to/reference/sample1.wav
sample2 /path/to/reference/sample2.wav
sample3 /path/to/reference/sample3.wav
sample1 /path/to/generated/sample1.wav
sample2 /path/to/generated/sample2.wav
sample3 /path/to/generated/sample3.wav
Use --io soundfile when running the scorer. Use Kaldi-style archive files (compatible with ESPnet): # Your .scp files point to .ark archives
sample1 /path/to/feats.ark:12345
sample2 /path/to/feats.ark:67890
Use --io kaldi when running the scorer.
Choose Your Metrics
Create or use an existing configuration file. Start with the speech example: # Mel Cepstral Distortion and F0 metrics
- name : mcd_f0
f0min : 40
f0max : 800
mcep_shift : 5
mcep_fftl : 1024
mcep_dim : 39
mcep_alpha : 0.466
# Signal quality metrics (SDR, SAR, SIR, CI-SDR, SI-SNR)
- name : signal_metric
# PESQ (Perceptual Evaluation of Speech Quality)
- name : pesq
# STOI (Short-Time Objective Intelligibility)
- name : stoi
# Speech BERT Score and BLEU
- name : discrete_speech
# MOS predictors (UTMOS, DNSMOS, PLCMOS)
- name : pseudo_mos
predictor_types : [ "utmos" , "dnsmos" , "plcmos" ]
# Speaker similarity
- name : speaker
model_tag : default
Explore other configurations in the egs/ directory for different use cases: singing.yaml, general.yaml, or individual metrics in egs/separate_metrics/.
Run the Evaluation
Execute the scorer with your configuration: python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1 \
--pred test/test_samples/test2 \
--output_file my_results \
--io dir
You’ll see progress output: INFO: The number of utterances = 1
INFO: Processing metrics...
INFO: Summary: {
'mcd': 5.045226506332897,
'f0rmse': 20.281004489942777,
'f0corr': -0.07540903652440145,
'pesq': 1.5722705125808716,
'stoi': 0.007625108859647406,
'utmos': 1.9074374437332153,
'spk_similarity': 0.895357072353363
}
Review the Results
Results are saved to my_results.txt with detailed per-utterance scores: utterance_id mcd=5.045 f0rmse=20.281 pesq=1.572 stoi=0.008 utmos=1.907
Common Evaluation Scenarios
Evaluating Without Reference Audio
Some metrics don’t require reference audio (independent metrics):
python versa/bin/scorer.py \
--score_config egs/separate_metrics/utmos2.yaml \
--pred generated_audio/ \
--output_file results \
--io dir
Independent Metrics Config
Run Without Ground Truth
# No reference needed - evaluates predicted audio only
- name : pseudo_mos
predictor_types : [ "utmos" , "dnsmos" ]
- name : nisqa
- name : vad
- name : speaking_rate
Evaluating with Text Transcriptions
Include text information for ASR-based metrics:
Prepare Text File
Create a text file with transcriptions: sample1 THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
sample2 SPEECH RECOGNITION EVALUATION METRICS
sample3 VERSATILE EVALUATION OF SPEECH AND AUDIO
Run with Text Input
python versa/bin/scorer.py \
--score_config egs/separate_metrics/wer.yaml \
--pred generated_audio.scp \
--gt reference_audio.scp \
--text transcriptions.txt \
--output_file wer_results \
--io soundfile
Using the Command-Line Interface
VERSA installs a console command for convenience:
versa-score \
--score_config egs/speech.yaml \
--pred audio.scp \
--gt reference.scp \
--output_file results
Python API Usage
For programmatic evaluation, use the Python API directly:
import yaml
from versa.scorer_shared import (
audio_loader_setup,
list_scoring,
load_score_modules,
load_summary
)
# Load your audio files
gen_files = audio_loader_setup( "generated_audio/" , io = "dir" )
gt_files = audio_loader_setup( "reference_audio/" , io = "dir" )
# Load configuration
with open ( "egs/speech.yaml" , "r" ) as f:
score_config = yaml.full_load(f)
# Initialize scoring modules
score_modules = load_score_modules(
score_config,
use_gt = True ,
use_gpu = False
)
# Run evaluation
score_info = list_scoring(
gen_files,
score_modules,
gt_files,
output_file = "results" ,
io = "dir"
)
# Get summary statistics
summary = load_summary(score_info)
print ( f "Average PESQ: { summary[ 'pesq' ] :.3f} " )
print ( f "Average STOI: { summary[ 'stoi' ] :.3f} " )
print ( f "Average UTMOS: { summary[ 'utmos' ] :.3f} " )
Using Individual Metrics
Import and use metrics directly:
import soundfile as sf
from versa import pesq_metric, stoi_metric, pseudo_mos_setup, pseudo_mos_metric
# Load audio
ref_audio, sr = sf.read( "reference.wav" )
gen_audio, sr = sf.read( "generated.wav" )
# Calculate PESQ (requires reference)
pesq_score = pesq_metric(gen_audio, ref_audio, sr)
print ( f "PESQ: { pesq_score :.3f} " )
# Calculate STOI (requires reference)
stoi_score = stoi_metric(gen_audio, ref_audio, sr)
print ( f "STOI: { stoi_score :.3f} " )
# Calculate UTMOS (no reference needed)
predictor = pseudo_mos_setup({ "predictor_types" : [ "utmos" ]})
utmos_score = pseudo_mos_metric(gen_audio, sr, predictor, "utmos" )
print ( f "UTMOS: { utmos_score :.3f} " )
Batch Evaluation Examples
Evaluate Multiple Configurations
# Test different metric combinations
for config in egs/separate_metrics/*.yaml ; do
echo "Evaluating with $( basename $config )"
python versa/bin/scorer.py \
--score_config " $config " \
--pred audio.scp \
--gt reference.scp \
--output_file "results_$( basename $config .yaml)"
done
GPU-Accelerated Evaluation
python versa/bin/scorer.py \
--score_config egs/speech_gpu.yaml \
--pred audio.scp \
--gt reference.scp \
--output_file gpu_results \
--use_gpu true \
--rank 0
GPU Memory : Some metrics (especially LLM-based ones like Qwen2-Audio) require significant GPU memory. Monitor your GPU usage with nvidia-smi.
Understanding Configuration Files
Each metric in the YAML configuration can have custom parameters:
# Basic metric (no parameters)
- name : pesq
# Metric with parameters
- name : mcd_f0
f0min : 40 # Minimum F0 in Hz
f0max : 800 # Maximum F0 in Hz
mcep_dim : 39 # Number of MCEP dimensions
dtw : false # Disable dynamic time warping
# Metric with multiple variants
- name : pseudo_mos
predictor_types :
- utmos
- dnsmos
- plcmos
- dnsmos_pro_bvcc
predictor_args :
utmos :
fs : 16000 # Sample rate
dnsmos :
fs : 16000
Explore configurations : Check the egs/ directory for pre-configured setups:
speech.yaml - Comprehensive speech evaluation
singing.yaml - Singing voice synthesis
speech_cpu.yaml - CPU-only metrics
speech_gpu.yaml - GPU-accelerated metrics
separate_metrics/ - Individual metric examples
Distributed Evaluation (Advanced)
For large datasets, use Slurm-based distributed evaluation:
Prepare SCP Files
Ensure your audio is indexed in SCP format: # List all files with IDs
find /path/to/audio -name "*.wav" | \
awk -F/ '{print $NF, $0}' | \
sed 's/.wav//' > audio.scp
Launch Distributed Jobs
./launch_slurm.sh \
generated_audio.scp \
reference_audio.scp \
results_dir/ \
32 # Number of parallel jobs
Aggregate Results
# Combine CPU metric results
cat results_dir/result/ * .result.cpu.txt > final_results_cpu.txt
# Combine GPU metric results
cat results_dir/result/ * .result.gpu.txt > final_results_gpu.txt
Visualize Results
python scripts/show_result.py final_results_cpu.txt
python scripts/show_result.py final_results_gpu.txt
Next Steps
Now that you’ve run your first evaluation, explore:
Metrics Reference Learn about all 90+ available metrics and their use cases
Configuration Guide Deep dive into metric configuration and customization
Visualization Create interactive visualizations of your results
API Reference Explore the complete Python API documentation
Try the interactive demo : Experiment with VERSA in Google Colab without any local installation: VERSA Colab Demo