Why VERSA?
Evaluating audio quality requires multiple perspectives. A single metric cannot capture perceptual quality, intelligibility, technical accuracy, and statistical properties simultaneously. VERSA solves this by providing:Comprehensive Coverage
Access 90+ metrics covering perceptual quality, intelligibility, technical measurements, and statistical properties in one unified toolkit.
Production Ready
Widely used in speech toolkits and challenges including ESPnet, with built-in support for distributed evaluation using Slurm.
Flexible Inputs
Support for various input formats including file paths, SCP files, and Kaldi-style ARKs for seamless integration.
Interactive Visualization
Built-in visualization tools to analyze and compare evaluation results across multiple metrics.
Key Features
Multiple Metric Categories
VERSA organizes metrics into four intuitive categories:Independent Metrics
Standalone metrics that don’t require reference audio. Examples include DNSMOS, UTMOS, NISQA, and voice activity detection.
Dependent Metrics
Metrics that compare predicted audio against reference audio. Examples include PESQ, STOI, MCD, and signal-to-noise ratios.
Non-match Metrics
Metrics that work with non-matching references or information from other modalities, such as ASR-based metrics and speaker similarity.
Real-World Applications
Speech Synthesis
Evaluate TTS and voice conversion systems with metrics for naturalness, similarity, and intelligibility.
Speech Enhancement
Assess denoising and enhancement quality with signal-based and perceptual metrics.
Audio Codecs
Measure codec quality with MCD, PESQ, STOI, and perceptual MOS predictors.
Singing Voice
Specialized metrics for singing voice synthesis and conversion, including SingMOS and chroma alignment.
Quick Example
Evaluate audio quality in just a few lines:speech.yaml and saves detailed results to results.txt.
New to audio evaluation? Start with the Quickstart to run your first evaluation in minutes.
What’s Inside
Core Metrics (Auto-Installed)
- Perceptual Quality: UTMOS, DNSMOS, NISQA, Sheet-SSQA
- Intelligibility: PESQ, STOI, TorchAudio-SQUIM
- Signal Metrics: SDR, SAR, SIR, SI-SNR, CI-SDR
- Spectral Distance: MCD (Mel Cepstral Distortion), F0 metrics
- Speaker Similarity: Cosine similarity using ESPnet-SPK models
- Discrete Speech: Speech BERT Score, Speech BLEU
Advanced Metrics (Optional Installation)
- LLM-Based Profiling: Qwen2-Audio for 20+ speech characteristics
- Perceptual Audio: DPAM, CDPAM distance metrics
- ASR-Based: Word error rate (WER) with Whisper, ESPnet, OWSM
- Distributional: FAD, KID for generative model evaluation
- Music-Specific: Chroma alignment, singing technique assessment
Integration & Scalability
VERSA is designed for both local experimentation and large-scale evaluation:Open Source & Widely Adopted
VERSA is used in production by leading speech research groups and has been integrated into major toolkits including ESPnet. Check the incomplete list of toolkits and challenges using VERSA.
Get Started
Installation
Install VERSA and configure metric-specific dependencies
Quickstart
Run your first evaluation in minutes with real examples
GitHub
Explore the source code and contribute to the project