Skip to main content
Non-match metrics use references that don’t directly correspond to the evaluated audio. These include text transcripts for ASR evaluation, text prompts for semantic similarity, or non-matching audio samples for perceptual quality assessment.
Metrics marked with Auto-Install are automatically installed with VERSA. Others require manual installation from their respective code sources.

Non-Matching Reference Quality Assessment

NumberAuto-InstallMetric NameKey in ConfigKey in ReportCode SourceReferences
1NORESQA: A Framework for Speech Quality Assessment using Non-Matching ReferencesnoresqanoresqaNoresqaPaper
2MOS in TorchAudio-Squimsquim_reftorch_squim_mostorch_squimpaper
8NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality AssessmentnomadnomadNomadpaper

Speech Recognition Error Rates

Automatic Speech Recognition (ASR) based metrics using text transcripts.
NumberAuto-InstallMetric NameKey in ConfigKey in ReportCode SourceReferences
3ESPnet Speech Recognition-based Error Rateespnet_werespnet_werESPnetpaper
4ESPnet-OWSM Speech Recognition-based Error Rateowsm_werowsm_werESPnetpaper
5OpenAI-Whisper Speech Recognition-based Error Ratewhisper_werwhisper_werWhisperpaper
11Log Likelihood Ratio (LLR)pysepmpysepm_llrpysepmPaper

Embedding Similarity Metrics

Metrics that compare learned embeddings for speaker, emotion, or singer similarity.
NumberAuto-InstallMetric NameKey in ConfigKey in ReportCode SourceReferences
6Emotion2vec similarity (emo2vec)emo2vec_similarityemotion_similarityemo2vecpaper
7Speaker Embedding Similarityspeakerspk_similarityespnetpaper
16Singer Embedding Similaritysingersinger_similaritySSL-Singer-Identitypaper

Text & Prompt-Based Metrics

Metrics that use text prompts or descriptions for evaluation.
NumberAuto-InstallMetric NameKey in ConfigKey in ReportCode SourceReferences
9Contrastive Language-Audio Pretraining Score (CLAP Score)clap_scoreclap_scorefadtkpaper
10Accompaniment Prompt Adherence (APA)apaapaSony-audio-metricspaper

Unified Models with Text Reference

NumberAuto-InstallMetric NameKey in ConfigKey in ReportCode SourceReferences
12Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Text Referenceuniversa_textrefuniversa_scoreUni-VERSApaper
13Uni-VERSA (Versatile Speech Assessment with a Unified Framework) with Full Referenceuniversa_fullrefuniversa_scoreUni-VERSApaper
14ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) with Text Referencearecho_textrefarecho_scoreARECHOpaper
15ARECHO (Audio Reference Echo Cancellation and Codec Quality Assessment) with Full Referencearecho_fullrefarecho_scoreARECHOpaper
Non-match metrics are particularly useful when you don’t have exact reference audio but have alternative references like transcripts, prompts, or non-matching audio samples.

Usage Guidelines

Use Word Error Rate (WER) metrics to evaluate speech recognition quality:
  • ESPnet WER: For general-purpose ASR evaluation
  • OWSM WER: For multilingual ASR with Open Whisper-style models
  • Whisper WER: For OpenAI Whisper-based evaluation
Lower WER indicates better intelligibility and clarity.
Evaluate whether generated speech maintains speaker identity:
  • Speaker Embedding Similarity: For voice conversion and cloning
  • Singer Similarity: Specifically for singing voice synthesis
Higher similarity scores indicate better preservation of speaker characteristics.
Assess emotional content and expression:
  • Emotion2vec similarity: Compare emotional characteristics between audio samples
Useful for evaluating expressive TTS and voice conversion systems.
Evaluate how well audio matches text descriptions or prompts:
  • CLAP Score: For general text-to-audio alignment
  • Accompaniment Prompt Adherence (APA): For music generation tasks
These metrics help ensure generated audio matches user intentions.
For comprehensive evaluation with multiple reference types:
  • Uni-VERSA: Supports text and full references
  • ARECHO: Specialized for echo cancellation and codec quality
These models can adapt to different reference types for flexible evaluation.

Reference Type Comparison

Metric CategoryReference TypeUse CaseExample
ASR-basedText transcriptIntelligibility evaluationWER, CER
Embedding-basedDifferent audio sampleSpeaker/emotion similaritySpeaker similarity, Emotion2vec
Prompt-basedText descriptionSemantic alignmentCLAP Score, APA
Non-matching audioClean speech (different content)Quality without exact matchNORESQA, NOMAD
When choosing between ASR-based metrics, consider your target domain:
  • ESPnet WER: General purpose, customizable
  • OWSM WER: Best for multilingual content
  • Whisper WER: Robust to accents and noisy conditions

Build docs developers (and LLMs) love