Overview
The scorer module provides the main interface for evaluating speech synthesis quality. It supports multiple I/O formats, GPU acceleration, and extensive configuration options for various metrics.Command-Line Interface
Main Function
The scorer can be invoked from the command line using theversa-scorer command or by running the scorer script directly.
CLI Arguments
Path to wav.scp file containing generated waveforms. Supports multiple I/O formats based on the
--io parameter.Path to YAML configuration file specifying which metrics to compute and their parameters.
Path to wav.scp file for ground truth waveforms. Required for reference-based metrics like MCD, PESQ, and speaker similarity.
Path to ground truth transcription file. Required for text-based metrics like WER/CER. Format:
utterance_id transcriptionPath to write evaluation results. Results are written in JSONL format (one JSON object per line).
Directory path for caching model weights and intermediate results.
Enable GPU acceleration for compatible metrics. Significantly speeds up neural network-based evaluations.
I/O interface to use for loading audio files.Options:
kaldi: Kaldi-style wav.scp with pipes (e.g.,sox file.wav -t wav - |)soundfile: Simple wav.scp with file pathsdir: Directory containing audio files
Logging verbosity level.Levels:
0: Warnings only1: Info messages (default)2: Debug messages
Process rank for batch processing. Used to specify GPU device when
use_gpu=True. GPU device is calculated as rank % num_gpus.Skip matching between generated and ground truth files. Useful when files have different naming schemes.
Core Functions
audio_loader_setup
Sets up audio file loading based on the specified I/O interface.Path to audio file list or directory.
I/O interface type:
kaldi, soundfile, or dir.load_score_modules
Loads and initializes scoring modules based on configuration.List of metric configurations loaded from YAML file. Each config is a dictionary with
name and metric-specific parameters.Whether ground truth audio is available. Reference-based metrics will be skipped if False.
Whether ground truth text transcriptions are available. Text-based metrics will be skipped if False.
Enable GPU acceleration for model-based metrics.
list_scoring
Performs utterance-level scoring across all generated files.Dictionary mapping utterance IDs to generated audio files.
Initialized scoring modules from
load_score_modules().Dictionary mapping utterance IDs to ground truth audio files.
Dictionary mapping utterance IDs to reference transcriptions.
Path to output file. If provided, results are written incrementally in JSONL format.
I/O interface type for loading audio.
Number of utterances to process before writing to output file.
load_corpus_modules
Loads corpus-level scoring modules (e.g., FAD, KID).List of corpus metric configurations.
Directory for caching embeddings and model weights.
Enable GPU acceleration.
I/O interface type.
corpus_scoring
Performs corpus-level evaluation.Path to generated audio file list.
Initialized corpus scoring modules.
Path to baseline/reference audio for distributional metrics like FAD.
Text information (currently unused for corpus metrics).
Path to save corpus-level results in YAML format.
load_summary
Computes summary statistics from utterance-level scores.List of score dictionaries returned by
list_scoring().Usage Examples
Basic Evaluation
Evaluation with Text Reference
Corpus-Level Evaluation
Notes
When using GPU acceleration with multiple processes, ensure each process gets assigned to a different GPU using the
--rank parameter. The GPU device is automatically calculated as rank % torch.cuda.device_count().Audio files are automatically resampled if the sampling rates between generated and ground truth audio don’t match. The higher sample rate audio will be downsampled to match the lower rate.