Metrics marked with Auto-Install are automatically installed with VERSA. Others require manual installation from their respective code sources.
Non-Matching Reference Quality Assessment
| Number | Auto-Install | Metric Name | Key in Config | Key in Report | Code Source | References |
|---|---|---|---|---|---|---|
| 1 | NORESQA: A Framework for Speech Quality Assessment using Non-Matching References | noresqa | noresqa | Noresqa | Paper | |
| 2 | ✓ | MOS in TorchAudio-Squim | squim_ref | torch_squim_mos | torch_squim | paper |
| 8 | NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment | nomad | nomad | Nomad | paper |
Speech Recognition Error Rates
Automatic Speech Recognition (ASR) based metrics using text transcripts.| Number | Auto-Install | Metric Name | Key in Config | Key in Report | Code Source | References |
|---|---|---|---|---|---|---|
| 3 | ✓ | ESPnet Speech Recognition-based Error Rate | espnet_wer | espnet_wer | ESPnet | paper |
| 4 | ✓ | ESPnet-OWSM Speech Recognition-based Error Rate | owsm_wer | owsm_wer | ESPnet | paper |
| 5 | ✓ | OpenAI-Whisper Speech Recognition-based Error Rate | whisper_wer | whisper_wer | Whisper | paper |
| 11 | Log Likelihood Ratio (LLR) | pysepm | pysepm_llr | pysepm | Paper |
Embedding Similarity Metrics
Metrics that compare learned embeddings for speaker, emotion, or singer similarity.Text & Prompt-Based Metrics
Metrics that use text prompts or descriptions for evaluation.| Number | Auto-Install | Metric Name | Key in Config | Key in Report | Code Source | References |
|---|---|---|---|---|---|---|
| 9 | Contrastive Language-Audio Pretraining Score (CLAP Score) | clap_score | clap_score | fadtk | paper | |
| 10 | Accompaniment Prompt Adherence (APA) | apa | apa | Sony-audio-metrics | paper |
Unified Models with Text Reference
| Number | Auto-Install | Metric Name | Key in Config | Key in Report | Code Source | References |
|---|---|---|---|---|---|---|
| 12 | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Text Reference | universa_textref | universa_score | Uni-VERSA | paper | |
| 13 | Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Full Reference | universa_fullref | universa_score | Uni-VERSA | paper | |
| 14 | ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) with Text Reference | arecho_textref | arecho_score | ARECHO | paper | |
| 15 | ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) with Full Reference | arecho_fullref | arecho_score | ARECHO | paper |
Non-match metrics are particularly useful when you don’t have exact reference audio but have alternative references like transcripts, prompts, or non-matching audio samples.
Usage Guidelines
ASR Evaluation
ASR Evaluation
Use Word Error Rate (WER) metrics to evaluate speech recognition quality:
- ESPnet WER: For general-purpose ASR evaluation
- OWSM WER: For multilingual ASR with Open Whisper-style models
- Whisper WER: For OpenAI Whisper-based evaluation
Speaker Similarity
Speaker Similarity
Evaluate whether generated speech maintains speaker identity:
- Speaker Embedding Similarity: For voice conversion and cloning
- Singer Similarity: Specifically for singing voice synthesis
Emotion & Expression
Emotion & Expression
Assess emotional content and expression:
- Emotion2vec similarity: Compare emotional characteristics between audio samples
Text-Audio Alignment
Text-Audio Alignment
Evaluate how well audio matches text descriptions or prompts:
- CLAP Score: For general text-to-audio alignment
- Accompaniment Prompt Adherence (APA): For music generation tasks
Unified Assessment
Unified Assessment
For comprehensive evaluation with multiple reference types:
- Uni-VERSA: Supports text and full references
- ARECHO: Specialized for echo cancellation and codec quality
Reference Type Comparison
| Metric Category | Reference Type | Use Case | Example |
|---|---|---|---|
| ASR-based | Text transcript | Intelligibility evaluation | WER, CER |
| Embedding-based | Different audio sample | Speaker/emotion similarity | Speaker similarity, Emotion2vec |
| Prompt-based | Text description | Semantic alignment | CLAP Score, APA |
| Non-matching audio | Clean speech (different content) | Quality without exact match | NORESQA, NOMAD |