scorer.py for evaluating speech and audio quality. This guide covers all CLI options and usage patterns.
After installation, you can use either
versa-score (installed command) or python versa/bin/scorer.py (direct script). This guide uses the direct script syntax for clarity.Basic Usage
The basic syntax for running VERSA from the command line:CLI Arguments Reference
Required Arguments
Path to the generated/predicted waveforms. Supports:
- SCP files (Kaldi-style):
pred.scp - Direct audio files:
audio.wav - Directory paths when using
--io dir
Path to the YAML configuration file specifying which metrics to compute.Examples:
egs/speech.yaml- Speech evaluation metricsegs/singing.yaml- Singing voice metricsegs/general.yaml- General audio metrics
Optional Arguments
Path to ground truth/reference waveforms. Use
None for reference-free evaluation.Default: NonePath to ground truth transcriptions for ASR-based metrics (WER/CER).Format:
Path for writing evaluation results. Results are saved in JSONL format.Default:
None (prints to stdout only)Directory for caching intermediate results and model outputs.Default:
NoneWhether to use GPU acceleration for neural network-based metrics.Default:
FalseI/O interface for loading audio files.Choices:
kaldi- Kaldi-style SCP/ARK files (compatible with ESPnet)soundfile- Direct audio file reading with soundfiledir- Directory-based audio loading
kaldiVerbosity level for logging output.Levels:
0- Warnings only1- Info messages (default)2+- Debug messages
1Overall rank in batch processing, used to specify GPU device ID.Useful for distributed processing to assign specific GPUs.Default:
0Flag to disable matching between ground truth and generated files.Use when files are pre-aligned or for independent evaluation.
Usage Examples
Basic Evaluation with Reference
Evaluate predicted speech against ground truth:Reference-Free Evaluation
Evaluate without ground truth (using only independent metrics):Evaluation with Transcriptions
Include ASR-based metrics (WER/CER) using text transcriptions:GPU-Accelerated Evaluation
Use GPU for faster processing of neural metrics:Directory-Based Evaluation
Evaluate all audio files in a directory:Kaldi/ESPnet Compatible Evaluation
Use Kaldi-style ARK files (compatible with ESPnet workflows):Input File Formats
SCP Format (Kaldi-style)
SCP files list utterance IDs and paths:Text Transcription Format
Text files map utterance IDs to transcriptions:Output Format
Results are saved in JSONL format (one JSON object per line):Verbosity Levels
Common Workflows
- Speech Synthesis
- Voice Conversion
- Speech Enhancement
- Singing Voice
Evaluate TTS/speech synthesis outputs:
Best Practices
Use Appropriate Config
Select the configuration file that matches your use case:
speech.yamlfor general speechspeech_gpu.yamlfor GPU-accelerated speech metricssinging.yamlfor singing voice
Enable GPU
Use
--use_gpu true when evaluating neural metrics like UTMOS, DNS-MOS, or speaker similarity for significantly faster processing.Set Verbosity
Use
--verbose 1 for progress tracking during long evaluations, or --verbose 0 when running in batch scripts.Cache Results
Use
--cache_folder to store intermediate results when re-running evaluations with different metric configurations.Next Steps
- Learn about Python API usage for programmatic access
- Explore distributed evaluation for large-scale processing
- Check visualization tools for analyzing results