Installation Issues
Metrics Requiring Manual Installation
Some metrics are not included in the default VERSA installation and require additional setup. Check the Auto-Install column in the Supported Metrics documentation.
UTMOSv2: UTokyo-SaruLab MOS Prediction System
Error: utmosv2 is not installedSolution: ./tools/install_utmosv2.sh
Additional requirements:
Git LFS must be installed and configured
If you get _pickle.UnpicklingError: invalid load key, 'v'., install Git LFS:
# Ubuntu/Debian
sudo apt-get install git-lfs
git lfs install
# macOS
brew install git-lfs
git lfs install
# Then re-clone or re-pull the model
rm -rf ~/.cache/torch/hub/checkpoints/utmosv2 *
./tools/install_utmosv2.sh
Reference: UTMOSv2 GitHub
ScoreQ: Speech Contrastive Regression for Quality Assessment
No Auto-Install: Requires manual installation for both reference and no-reference versionsSolution: git clone https://github.com/ftshijt/scoreq.git
cd scoreq
pip install -e .
Config keys:
scoreq_nr - No-reference version
scoreq_ref - With-reference version
Reference: ScoreQ Paper
SRMR: Speech-to-Reverberation Modulation energy Ratio
No Auto-Install Solution: git clone https://github.com/shimhz/SRMRpy.git
cd SRMRpy
pip install -e .
Reference: SRMR Paper
No Auto-Install Solution: git clone https://github.com/facebookresearch/audiobox-aesthetics.git
cd audiobox-aesthetics
pip install -e .
Returns multiple scores:
audiobox_aesthetics_CE - Clarity/Engagement
audiobox_aesthetics_CU - Consonant Understanding
audiobox_aesthetics_PC - Pitch Contour
audiobox_aesthetics_PQ - Prosody Quality
Reference: Audiobox Paper
No Auto-Install: Advanced multi-modal assessment modelsSolution: # Install from HuggingFace
pip install transformers torch
Available variants:
universa_noref - No reference
universa_audioref - With audio reference
universa_textref - With text reference
universa_fullref - With both audio and text reference
arecho_noref - Echo cancellation and codec quality (no reference)
arecho_audioref - With audio reference
arecho_textref - With text reference
arecho_fullref - With full reference
Reference: Uni-VERSA Collection
WV-MOS: MOS prediction by fine-tuned wav2vec2.0
No Auto-Install Solution: pip install git+https://github.com/AndreevP/wvmos.git
Reference: WV-MOS Paper
SIG-MOS: Multi-dimensional MOS prediction
No Auto-Install Solution: git clone https://github.com/microsoft/SIG-Challenge.git
cd SIG-Challenge/ICASSP2024/sigmos
pip install -e .
Returns multiple scores:
SIGMOS_COL - Coloration
SIGMOS_DISC - Discontinuity
SIGMOS_LOUD - Loudness
SIGMOS_REVERB - Reverberation
SIGMOS_SIG - Signal quality
SIGMOS_OVRL - Overall quality
Reference: SIG-MOS Paper
No Auto-Install Solution: pip install git+https://github.com/ftshijt/emotion2vec.git
Reference: Emotion2vec Paper
NOMAD: Perceptual Embeddings for Speech Enhancement
No Auto-Install Solution: git clone https://github.com/shimhz/nomad.git
cd nomad
pip install -e .
Reference: NOMAD Paper
CLAP Score: Contrastive Language-Audio Pretraining
No Auto-Install Solution: pip install git+https://github.com/gudgud96/frechet-audio-distance.git
Reference: CLAP Paper
APA: Accompaniment Prompt Adherence
No Auto-Install Solution: git clone https://github.com/SonyCSLParis/audio-metrics.git
cd audio-metrics
pip install -e .
Reference: APA Paper
VISQOL: Virtual Speech Quality Objective Listener
No Auto-Install Solution: # Requires Google's VISQOL implementation
git clone https://github.com/google/visqol.git
cd visqol
# Follow build instructions in repository
Reference: VISQOL Paper
WARP-Q: Dynamic Time Warping Cost Metric
No Auto-Install Solution: git clone https://github.com/wjassim/WARP-Q.git
cd WARP-Q
pip install -e .
Reference: WARP-Q Paper
NORESQA: Non-Matching Reference Speech Quality Assessment
No Auto-Install Solution: git clone https://github.com/shimhz/Noresqa.git
cd Noresqa
pip install -e .
Reference: NORESQA Paper
Distributional Metrics: FAD, KL Divergence, Density, Coverage
No Auto-Install: Require full corpus for computationSolution: # FAD and related metrics
pip install git+https://github.com/microsoft/fadtk.git
# Audio density and coverage
git clone https://github.com/SonyCSLParis/audio-metrics.git
cd audio-metrics
pip install -e .
# KL divergence
pip install git+https://github.com/Stability-AI/stable-audio-metrics.git
Available metrics:
fad - Frechet Audio Distance
kl_embedding - KL Divergence on embeddings
audio_density - Audio Density Score
audio_coverage - Audio Coverage Score
PySepm Metrics: FWSEGSNR, WSS, CD, Composite, CSII, NCM, LLR
No Auto-Install Solution: git clone https://github.com/shimhz/pysepm.git
cd pysepm
pip install -e .
Available metrics:
pysepm_fwsegsnr - Frequency-Weighted Segmental SNR
pysepm_wss - Weighted Spectral Slope
pysepm_cd - Cepstrum Distance
pysepm_Csig, pysepm_Cbak, pysepm_Covl - Composite measures
pysepm_csii_high, pysepm_csii_mid, pysepm_csii_low - CSII
pysepm_ncm - Normalized-covariance measure
pysepm_llr - Log Likelihood Ratio
Reference: PySepm Paper
Dependency Conflicts
Some metrics require specific package versions that may conflict with other dependencies.
Recommended Approach: Virtual Environments
Create separate environments for conflicting metrics:
# Main VERSA environment
conda create -n versa python= 3.9
conda activate versa
pip install versa-speech-eval
# Separate environment for specific metrics
conda create -n versa-utmosv2 python= 3.9
conda activate versa-utmosv2
pip install versa-speech-eval
./tools/install_utmosv2.sh
ONNX Runtime Issues
If metrics using ONNX Runtime (DNSMOS, PLCMOS) fail:
pip uninstall onnxruntime onnxruntime-gpu
pip install onnxruntime== 1.15.0 # Try specific version
Runtime Errors
GPU Out of Memory
RuntimeError : CUDA out of memory
Solutions:
Reduce batch size
Process fewer files at once or use single-file processing mode.
Use CPU fallback
Set use_gpu: false in your configuration: score :
- name : pseudo_mos
predictor : utmos
use_gpu : false
Clear GPU cache
import torch
torch.cuda.empty_cache()
Use mixed metrics
Run GPU-intensive metrics separately from CPU metrics to avoid loading all models simultaneously.
Sample Rate Mismatches
ValueError : Sample rate mismatch
Most metrics automatically resample, but some require specific sample rates:
16 kHz : UTMOS, speaker similarity, Whisper WER, OWSM WER
48 kHz : Some professional audio metrics
Flexible : PESQ, STOI (but performance varies)
Solution:
Pre-resample your audio files:
import librosa
import soundfile as sf
audio, sr = librosa.load( 'input.wav' , sr = None )
audio_16k = librosa.resample(audio, orig_sr = sr, target_sr = 16000 )
sf.write( 'output_16k.wav' , audio_16k, 16000 )
Missing Model Checkpoints
FileNotFoundError : Model checkpoint not found
Solution:
Clear cache and re-download:
rm -rf ~/.cache/torch/hub
rm -rf versa_cache/
# Re-run your evaluation to trigger download
python -m versa.bin.score --config config.yaml ...
Import Errors
ModuleNotFoundError : No module named 'X'
Solution:
Verify installation:
pip list | grep versa
pip install --upgrade versa-speech-eval
For development installation:
cd versa
pip install -e .
VERSA expects Kaldi-style wav.scp format:
utterance_001 /path/to/audio1.wav
utterance_002 /path/to/audio2.wav
utterance_003 /path/to/audio3.flac
Each line: <utterance_id> <absolute_or_relative_path>
Supported formats: WAV, FLAC, MP3, OGG
Paths should not end with pipe | unless using Kaldi I/O
Common mistakes:
# Wrong: No utterance ID
/path/to/audio1.wav
# Wrong: Multiple spaces or tabs (use single space)
utterance_001 /path/to/audio1.wav
# Correct
utterance_001 /path/to/audio1.wav
Text File Format
For text-dependent metrics (WER, text-based assessment):
utterance_001 This is the transcription for audio one.
utterance_002 Another transcription here.
utterance_003 Make sure utterance IDs match wav.scp exactly.
Mismatched Utterance IDs
KeyError : 'utterance_001'
Ensure utterance IDs match across all files:
# Check IDs in prediction file
cut -d ' ' -f1 pred.scp | sort > pred_ids.txt
# Check IDs in ground truth file
cut -d ' ' -f1 gt.scp | sort > gt_ids.txt
# Check IDs in text file
cut -d ' ' -f1 text.txt | sort > text_ids.txt
# Compare
diff pred_ids.txt gt_ids.txt
Slow Processing
Enable GPU acceleration
score :
- name : pseudo_mos
use_gpu : true
Reduce metric count
Run only essential metrics in your initial evaluation: score :
# Start with fast, essential metrics
- name : pseudo_mos
predictor : utmos
- name : pesq
# Add more later as needed
Cache models
Set cache directory to persistent location: export TORCH_HOME = / path / to / persistent / cache
Configuration Issues
Invalid YAML Syntax
YAMLError : mapping values are not allowed here
Common issues:
# Wrong: Missing space after colon
score :
- name:pseudo_mos
# Correct
score :
- name : pseudo_mos
# Wrong: Inconsistent indentation
score :
- name : pseudo_mos
predictor : utmos # Should align with 'name'
# Correct
score :
- name : pseudo_mos
predictor : utmos
Metric Not Found
KeyError : 'my_metric' not found in available metrics
Verify metric name in versa/scorer_shared.py and that it’s properly registered.
Available metrics are listed in Supported Metrics .
Getting Help
GitHub Issues Report bugs and request features
Documentation Review guides and API reference
Example Configs Browse working configuration examples
Contributing Guide Learn how to contribute fixes
When reporting issues, include:
VERSA version: pip show versa-speech-eval
Python version: python --version
Operating system and GPU type (if using GPU)
Complete error traceback
Minimal configuration file that reproduces the issue