Skip to main content
VERSA (Versatile Evaluation of Speech and Audio) is a comprehensive toolkit for evaluating speech and audio quality. It provides seamless access to over 90 evaluation metrics with 10x variants, enabling you to assess audio quality through multiple dimensions.

Why VERSA?

Evaluating audio quality requires multiple perspectives. A single metric cannot capture perceptual quality, intelligibility, technical accuracy, and statistical properties simultaneously. VERSA solves this by providing:

Comprehensive Coverage

Access 90+ metrics covering perceptual quality, intelligibility, technical measurements, and statistical properties in one unified toolkit.

Production Ready

Widely used in speech toolkits and challenges including ESPnet, with built-in support for distributed evaluation using Slurm.

Flexible Inputs

Support for various input formats including file paths, SCP files, and Kaldi-style ARKs for seamless integration.

Interactive Visualization

Built-in visualization tools to analyze and compare evaluation results across multiple metrics.

Key Features

Multiple Metric Categories

VERSA organizes metrics into four intuitive categories:
1

Independent Metrics

Standalone metrics that don’t require reference audio. Examples include DNSMOS, UTMOS, NISQA, and voice activity detection.
2

Dependent Metrics

Metrics that compare predicted audio against reference audio. Examples include PESQ, STOI, MCD, and signal-to-noise ratios.
3

Non-match Metrics

Metrics that work with non-matching references or information from other modalities, such as ASR-based metrics and speaker similarity.
4

Distributional Metrics

Metrics that evaluate statistical properties of audio collections, including FAD (Fréchet Audio Distance) and KID.

Real-World Applications

Speech Synthesis

Evaluate TTS and voice conversion systems with metrics for naturalness, similarity, and intelligibility.

Speech Enhancement

Assess denoising and enhancement quality with signal-based and perceptual metrics.

Audio Codecs

Measure codec quality with MCD, PESQ, STOI, and perceptual MOS predictors.

Singing Voice

Specialized metrics for singing voice synthesis and conversion, including SingMOS and chroma alignment.

Quick Example

Evaluate audio quality in just a few lines:
python versa/bin/scorer.py \
    --score_config egs/speech.yaml \
    --gt reference_audio/ \
    --pred generated_audio/ \
    --output_file results \
    --io dir
This command evaluates all metrics defined in speech.yaml and saves detailed results to results.txt.
New to audio evaluation? Start with the Quickstart to run your first evaluation in minutes.

What’s Inside

Core Metrics (Auto-Installed)

  • Perceptual Quality: UTMOS, DNSMOS, NISQA, Sheet-SSQA
  • Intelligibility: PESQ, STOI, TorchAudio-SQUIM
  • Signal Metrics: SDR, SAR, SIR, SI-SNR, CI-SDR
  • Spectral Distance: MCD (Mel Cepstral Distortion), F0 metrics
  • Speaker Similarity: Cosine similarity using ESPnet-SPK models
  • Discrete Speech: Speech BERT Score, Speech BLEU

Advanced Metrics (Optional Installation)

  • LLM-Based Profiling: Qwen2-Audio for 20+ speech characteristics
  • Perceptual Audio: DPAM, CDPAM distance metrics
  • ASR-Based: Word error rate (WER) with Whisper, ESPnet, OWSM
  • Distributional: FAD, KID for generative model evaluation
  • Music-Specific: Chroma alignment, singing technique assessment
See the full metrics documentation for a complete list with references.

Integration & Scalability

VERSA is designed for both local experimentation and large-scale evaluation:
# Single process evaluation
python versa/bin/scorer.py \
    --score_config config.yaml \
    --pred audio.scp \
    --gt reference.scp \
    --output_file results

Open Source & Widely Adopted

VERSA is used in production by leading speech research groups and has been integrated into major toolkits including ESPnet. Check the incomplete list of toolkits and challenges using VERSA.

Get Started

Installation

Install VERSA and configure metric-specific dependencies

Quickstart

Run your first evaluation in minutes with real examples

GitHub

Explore the source code and contribute to the project

Research & Citation

VERSA was presented at NAACL 2025 and has been featured in multiple publications. If you use VERSA in your research, please cite:
@inproceedings{shi2025versa,
title={{VERSA}: A Versatile Evaluation Toolkit for Speech, Audio, and Music},
author={Jiatong Shi and Hye-jin Shim and Jinchuan Tian and Siddhant Arora and others},
booktitle={2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
year={2025},
url={https://openreview.net/forum?id=zU0hmbnyQm}
}
Want to learn more? Check out the presentation video from NAACL 2025 or try the interactive Colab demo.

Build docs developers (and LLMs) love