Skip to main content

Overview

The scorer module provides the main interface for evaluating speech synthesis quality. It supports multiple I/O formats, GPU acceleration, and extensive configuration options for various metrics.

Command-Line Interface

Main Function

The scorer can be invoked from the command line using the versa-scorer command or by running the scorer script directly.
python -m versa.bin.scorer \
  --pred generated_audio.scp \
  --score_config config.yaml \
  --output_file results.jsonl

CLI Arguments

pred
string
required
Path to wav.scp file containing generated waveforms. Supports multiple I/O formats based on the --io parameter.
score_config
string
required
Path to YAML configuration file specifying which metrics to compute and their parameters.
gt
string
default:"None"
Path to wav.scp file for ground truth waveforms. Required for reference-based metrics like MCD, PESQ, and speaker similarity.
text
string
default:"None"
Path to ground truth transcription file. Required for text-based metrics like WER/CER. Format: utterance_id transcription
output_file
string
default:"None"
Path to write evaluation results. Results are written in JSONL format (one JSON object per line).
cache_folder
string
default:"None"
Directory path for caching model weights and intermediate results.
use_gpu
bool
default:"false"
Enable GPU acceleration for compatible metrics. Significantly speeds up neural network-based evaluations.
io
string
default:"kaldi"
I/O interface to use for loading audio files.Options:
  • kaldi: Kaldi-style wav.scp with pipes (e.g., sox file.wav -t wav - |)
  • soundfile: Simple wav.scp with file paths
  • dir: Directory containing audio files
verbose
int
default:"1"
Logging verbosity level.Levels:
  • 0: Warnings only
  • 1: Info messages (default)
  • 2: Debug messages
rank
int
default:"0"
Process rank for batch processing. Used to specify GPU device when use_gpu=True. GPU device is calculated as rank % num_gpus.
no_match
bool
default:"false"
Skip matching between generated and ground truth files. Useful when files have different naming schemes.

Core Functions

audio_loader_setup

Sets up audio file loading based on the specified I/O interface.
from versa.scorer_shared import audio_loader_setup

audio_files = audio_loader_setup(
    audio="path/to/wav.scp",
    io="kaldi"
)
audio
string
required
Path to audio file list or directory.
io
string
required
I/O interface type: kaldi, soundfile, or dir.
Returns: Dictionary mapping utterance IDs to file paths or audio data.

load_score_modules

Loads and initializes scoring modules based on configuration.
from versa.scorer_shared import load_score_modules
import yaml

with open("config.yaml") as f:
    score_config = yaml.full_load(f)

modules = load_score_modules(
    score_config=score_config,
    use_gt=True,
    use_gt_text=False,
    use_gpu=True
)
score_config
list
required
List of metric configurations loaded from YAML file. Each config is a dictionary with name and metric-specific parameters.
use_gt
bool
default:"true"
Whether ground truth audio is available. Reference-based metrics will be skipped if False.
use_gt_text
bool
default:"false"
Whether ground truth text transcriptions are available. Text-based metrics will be skipped if False.
use_gpu
bool
default:"false"
Enable GPU acceleration for model-based metrics.
Returns: Dictionary of initialized scoring modules with their configurations.

list_scoring

Performs utterance-level scoring across all generated files.
from versa.scorer_shared import list_scoring

score_info = list_scoring(
    gen_files=gen_audio_dict,
    score_modules=modules,
    gt_files=gt_audio_dict,
    text_info=text_dict,
    output_file="results.jsonl",
    io="kaldi",
    batch_size=1
)
gen_files
dict
required
Dictionary mapping utterance IDs to generated audio files.
score_modules
dict
required
Initialized scoring modules from load_score_modules().
gt_files
dict
default:"None"
Dictionary mapping utterance IDs to ground truth audio files.
text_info
dict
default:"None"
Dictionary mapping utterance IDs to reference transcriptions.
output_file
string
default:"None"
Path to output file. If provided, results are written incrementally in JSONL format.
io
string
default:"kaldi"
I/O interface type for loading audio.
batch_size
int
default:"1"
Number of utterances to process before writing to output file.
Returns: List of dictionaries containing scores for each utterance.

load_corpus_modules

Loads corpus-level scoring modules (e.g., FAD, KID).
from versa.scorer_shared import load_corpus_modules

corpus_modules = load_corpus_modules(
    score_config=score_config,
    cache_folder="versa_cache",
    use_gpu=True,
    io="kaldi"
)
score_config
list
required
List of corpus metric configurations.
cache_folder
string
default:"versa_cache"
Directory for caching embeddings and model weights.
use_gpu
bool
default:"false"
Enable GPU acceleration.
io
string
default:"kaldi"
I/O interface type.
Returns: Dictionary of initialized corpus-level scoring modules.

corpus_scoring

Performs corpus-level evaluation.
from versa.scorer_shared import corpus_scoring

corpus_scores = corpus_scoring(
    gen_files="generated.scp",
    score_modules=corpus_modules,
    base_files="baseline.scp",
    text_info=None,
    output_file="corpus_results.yaml"
)
gen_files
string
required
Path to generated audio file list.
score_modules
dict
required
Initialized corpus scoring modules.
base_files
string
default:"None"
Path to baseline/reference audio for distributional metrics like FAD.
text_info
dict
default:"None"
Text information (currently unused for corpus metrics).
output_file
string
default:"None"
Path to save corpus-level results in YAML format.
Returns: Dictionary containing corpus-level metric scores.

load_summary

Computes summary statistics from utterance-level scores.
from versa.scorer_shared import load_summary

summary = load_summary(score_info)
print(summary)
# {'utmos': 3.45, 'pesq': 2.87, 'whisper_wer': 150, ...}
score_info
list
required
List of score dictionaries returned by list_scoring().
Returns: Dictionary with averaged metrics (or summed for WER/CER).

Usage Examples

Basic Evaluation

import yaml
from versa.scorer_shared import (
    audio_loader_setup,
    load_score_modules,
    list_scoring,
    load_summary
)

# Load audio files
gen_files = audio_loader_setup("generated.scp", io="kaldi")
gt_files = audio_loader_setup("ground_truth.scp", io="kaldi")

# Load configuration
with open("config.yaml") as f:
    config = yaml.full_load(f)

# Initialize metrics
modules = load_score_modules(
    score_config=config,
    use_gt=True,
    use_gpu=True
)

# Evaluate
scores = list_scoring(
    gen_files=gen_files,
    score_modules=modules,
    gt_files=gt_files,
    output_file="results.jsonl"
)

# Get summary
summary = load_summary(scores)
print(f"Average UTMOS: {summary.get('utmos', 'N/A')}")

Evaluation with Text Reference

# Load text references
text_info = {}
with open("transcriptions.txt") as f:
    for line in f:
        key, text = line.strip().split(maxsplit=1)
        text_info[key] = text

# Initialize with text support
modules = load_score_modules(
    score_config=config,
    use_gt=True,
    use_gt_text=True,
    use_gpu=True
)

# Evaluate with text
scores = list_scoring(
    gen_files=gen_files,
    score_modules=modules,
    gt_files=gt_files,
    text_info=text_info,
    output_file="results_with_wer.jsonl"
)

Corpus-Level Evaluation

from versa.scorer_shared import load_corpus_modules, corpus_scoring

# Load corpus modules (e.g., FAD)
corpus_modules = load_corpus_modules(
    score_config=config,
    cache_folder="versa_cache",
    use_gpu=True
)

# Compute corpus metrics
corpus_scores = corpus_scoring(
    gen_files="generated.scp",
    score_modules=corpus_modules,
    base_files="real_data.scp",
    output_file="fad_results.yaml"
)

print(f"FAD Score: {corpus_scores.get('fad_default', 'N/A')}")

Notes

When using GPU acceleration with multiple processes, ensure each process gets assigned to a different GPU using the --rank parameter. The GPU device is automatically calculated as rank % torch.cuda.device_count().
Ground truth and generated audio files must have matching keys in their respective .scp files unless --no_match is specified.
Audio files are automatically resampled if the sampling rates between generated and ground truth audio don’t match. The higher sample rate audio will be downsampled to match the lower rate.

Build docs developers (and LLMs) love