Evaluating Models

Overview

The evaluation script provides comprehensive model assessment capabilities with support for multiple output formats and detailed metrics reporting.

Quick Start

Prepare Test Data

Organize your test data:

data/test/
├── test_labels.csv
└── videos/
    ├── test_video001.mp4
    └── ...

Run Evaluation

python scripts/evaluate.py \
    --model dover \
    --checkpoint models/dover_best.pt \
    --data data/test

Review Results

Results are saved in the results/ directory with multiple formats.

Evaluation Commands

DOVER++ Model

python scripts/evaluate.py \
    --model dover \
    --checkpoint models/dover_best.pt \
    --data data/test

V-JEPA2 Model

python scripts/evaluate.py \
    --model vjepa \
    --checkpoint models/vjepa_best.pt \
    --data data/test

Command-Line Arguments

Argument	Description	Default	Required
`--model`	Model type: `dover` or `vjepa`	-	Yes
`--checkpoint`	Path to model checkpoint	-	Yes
`--data`	Path to test data directory	-	Yes
`--output`	Output directory for results	`results`	No
`--batch-size`	Batch size for evaluation	1	No
`--device`	Device to use: `cuda` or `cpu`	`cuda`	No
`--csv-name`	Name of test CSV file	`test_labels.csv`	No
`--video-dir`	Name of video directory	`videos`	No

Batch size of 1 is recommended for evaluation to ensure consistent memory usage.

Evaluation Metrics

The evaluation computes three key metrics (src/utils/metrics.py:33):

SROCC (Spearman Rank Order Correlation Coefficient)

Measures the monotonic relationship between predicted and ground truth scores. Values range from -1 to 1, where:

1.0 = Perfect positive correlation
0.0 = No correlation
-1.0 = Perfect negative correlation

Best for: Ranking quality

PLCC (Pearson Linear Correlation Coefficient)

Measures the linear relationship between predicted and ground truth scores. Values range from -1 to 1. Best for: Absolute score accuracy

VQualA Score

The official challenge metric:

VQualA_Score = (SROCC + PLCC) / 2

This is the primary metric used for model comparison (scripts/evaluate.py:264).

Higher values are better for all metrics. A VQualA score above 0.80 indicates strong performance.

Output Formats

Evaluation generates multiple output files (scripts/evaluate.py:270):

1. Predictions CSV

File: predictions_{MODEL}_{TIMESTAMP}.csv

video_name,Traditional_MOS,Alignment_MOS,Aesthetic_MOS,Temporal_MOS,Overall_MOS
test_video001.mp4,3.24,4.15,3.82,3.51,3.68
test_video002.mp4,4.52,4.18,4.76,4.13,4.40

Contains predicted MOS scores for all five quality dimensions:

Traditional MOS (image fidelity)
Alignment MOS (text-video alignment)
Aesthetic MOS (visual appeal)
Temporal MOS (temporal consistency)
Overall MOS (aggregate quality)

2. Predictions Excel

File: predictions_{MODEL}_{TIMESTAMP}.xlsx Same data as CSV but in Excel format for easy viewing and analysis.

3. Results JSON

File: results_{MODEL}_{TIMESTAMP}.json

{
  "model_type": "dover",
  "checkpoint_path": "models/dover_best.pt",
  "timestamp": "20250304_143022",
  "num_samples": 500,
  "config": {
    "video_resolution": [640, 640],
    "num_frames": 64,
    "batch_size": 4
  },
  "metrics": {
    "srocc": 0.8234,
    "plcc": 0.8156,
    "vquala_score": 0.8195
  },
  "prediction_stats": {
    "min": 1.23,
    "max": 4.89,
    "mean": 3.45,
    "std": 0.87
  }
}

4. Summary Report

File: report_{MODEL}_{TIMESTAMP}.txt Human-readable text report:

QualiVision Model Evaluation Report
===================================

Model: DOVER
Checkpoint: models/dover_best.pt
Timestamp: 20250304_143022
Samples: 500

Model Configuration:
-------------------
  video_resolution: (640, 640)
  num_frames: 64
  batch_size: 4
  learning_rate: 0.0001

Evaluation Metrics:
------------------
  srocc: 0.8234
  plcc: 0.8156
  vquala_score: 0.8195

Prediction Statistics:
---------------------
  Min: 1.23
  Max: 4.89
  Mean: 3.45
  Std: 0.87

Interpreting Results

Score Distributions

Check the prediction statistics in the JSON output: Healthy Distribution:

Mean: 3.0-4.0 (centered around mid-range)
Std: 0.5-1.0 (reasonable spread)
Range: 1.0-5.0 (using full scale)

Warning Signs:

Mean < 2.0 or > 4.5: Model may be biased
Std < 0.3: Model may be under-confident
Std > 1.5: Model may be over-confident

Metric Interpretation

VQualA Score	Interpretation
> 0.90	Excellent correlation
0.80-0.90	Strong correlation
0.70-0.80	Good correlation
0.60-0.70	Moderate correlation
< 0.60	Poor correlation

SROCC vs PLCC

SROCC > PLCC: Model is good at ranking but may have scale issues

Solution: Recalibrate output scaling

PLCC > SROCC: Model predicts absolute values well but ranking is off

Solution: Increase ranking loss weight in training

Console Output

During evaluation (scripts/evaluate.py:188):

QualiVision Model Evaluation
============================
Model: DOVER
Checkpoint: models/dover_best.pt
Test CSV: data/test/test_labels.csv
Test videos: data/test/videos
Output: results/
Device: cuda

Initializing DOVER Model Evaluator
Checkpoint: models/dover_best.pt
Device: cuda
✓ Model loaded successfully
GPU Memory - Allocated: 8.2GB, Free: 15.8GB, Max Used: 8.2GB

Evaluating on test dataset:
  CSV: data/test/test_labels.csv
  Videos: data/test/videos
  Batch size: 1

Generating predictions...
Predicting: 100%|███████████| 500/500 [15:23<00:00,  1.85s/it]
✓ Generated predictions for 500 samples
✓ Ground truth labels found, computing metrics

Evaluation Results:
------------------
  SROCC: 0.8234
  PLCC: 0.8156
  VQualA Score: 0.8195

✓ Predictions saved:
  CSV: results/predictions_DOVER_20250304_143022.csv
  Excel: results/predictions_DOVER_20250304_143022.xlsx
✓ Results saved: results/results_DOVER_20250304_143022.json
✓ Summary report saved: results/report_DOVER_20250304_143022.txt

✓ Evaluation completed successfully!
Final VQualA Score: 0.8195

Evaluation Without Ground Truth

If your test CSV doesn’t contain MOS labels (scripts/evaluate.py:173):

python scripts/evaluate.py \
    --model dover \
    --checkpoint models/dover_best.pt \
    --data data/unlabeled_test

Output:

⚠ No ground truth labels found, skipping metrics computation
✓ Predictions saved (metrics not computed)

The predictions CSV/Excel will still be generated for submission.

Memory Management

The evaluator includes automatic memory cleanup (scripts/evaluate.py:214):

# Memory cleanup every 10 batches
if i % 10 == 0:
    ultra_memory_cleanup()

OOM Handling: Failed batches receive dummy predictions and a warning:

⚠ Error processing batch 42: CUDA out of memory

Reduce --batch-size to 1 if experiencing memory issues during evaluation.

Comparing Models

Evaluate multiple models and compare VQualA scores:

# Evaluate DOVER++
python scripts/evaluate.py --model dover --checkpoint models/dover_best.pt --data data/test

# Evaluate V-JEPA2
python scripts/evaluate.py --model vjepa --checkpoint models/vjepa_best.pt --data data/test

Compare the VQualA scores in the output:

DOVER++ VQualA Score: 0.8195
V-JEPA2 VQualA Score: 0.8347

Benchmark Results

Expected performance on VQualA 2025 Challenge:

Model	SROCC	PLCC	VQualA Score	Memory	Inference Time
DOVER++	TBA	TBA	TBA	~12GB	~1.8s/video
V-JEPA2	TBA	TBA	TBA	~16GB	~2.5s/video

Troubleshooting

Checkpoint Not Found

Error: Checkpoint not found: models/dover_best.pt

Solution: Verify checkpoint path or train a model first.

CUDA Out of Memory

⚠ OOM during validation, skipping batch...

Solution: Use --batch-size 1 or --device cpu.

Low Correlation Scores

Possible causes:

Model undertrained (train longer)
Data distribution mismatch (check test set)
Wrong checkpoint loaded (verify path)

Next Steps

Custom Datasets

Adapt QualiVision for your data

API Reference

Explore the model APIs

Get Started

Core Concepts

Guides

Overview

Quick Start

Evaluation Commands

DOVER++ Model

V-JEPA2 Model

Command-Line Arguments

Evaluation Metrics

SROCC (Spearman Rank Order Correlation Coefficient)

PLCC (Pearson Linear Correlation Coefficient)

VQualA Score

Output Formats

1. Predictions CSV

2. Predictions Excel

3. Results JSON

4. Summary Report

Interpreting Results

Score Distributions

Metric Interpretation

SROCC vs PLCC

Console Output

Evaluation Without Ground Truth

Memory Management

Comparing Models

Benchmark Results

Troubleshooting

Checkpoint Not Found

CUDA Out of Memory

Low Correlation Scores

Next Steps

Custom Datasets

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Overview

​Quick Start

​Evaluation Commands

​DOVER++ Model

​V-JEPA2 Model

​Command-Line Arguments

​Evaluation Metrics

​SROCC (Spearman Rank Order Correlation Coefficient)

​PLCC (Pearson Linear Correlation Coefficient)

​VQualA Score

​Output Formats

​1. Predictions CSV

​2. Predictions Excel

​3. Results JSON

​4. Summary Report

​Interpreting Results

​Score Distributions

​Metric Interpretation

​SROCC vs PLCC

​Console Output

​Evaluation Without Ground Truth

​Memory Management

​Comparing Models

​Benchmark Results

​Troubleshooting

​Checkpoint Not Found

​CUDA Out of Memory

​Low Correlation Scores

​Next Steps

Custom Datasets

API Reference

Build docs developers (and LLMs) love

Overview

Quick Start

Evaluation Commands

DOVER++ Model

V-JEPA2 Model

Command-Line Arguments

Evaluation Metrics

SROCC (Spearman Rank Order Correlation Coefficient)

PLCC (Pearson Linear Correlation Coefficient)

VQualA Score

Output Formats

1. Predictions CSV

2. Predictions Excel

3. Results JSON

4. Summary Report

Interpreting Results

Score Distributions

Metric Interpretation

SROCC vs PLCC

Console Output

Evaluation Without Ground Truth

Memory Management

Comparing Models

Benchmark Results

Troubleshooting

Checkpoint Not Found

CUDA Out of Memory

Low Correlation Scores

Next Steps