Skip to main content

Overview

QualiVision evaluates AI-generated videos across 4 critical quality dimensions, each capturing a distinct aspect of video quality. These dimensions are designed specifically for AI-generated content where traditional quality metrics may fall short.

Temporal Consistency

Coherence and smoothness across video frames

Image Fidelity

Visual quality, sharpness, and technical excellence

Aesthetic Appeal

Artistic quality and visual attractiveness

Text-Video Alignment

Correspondence between prompt and generated content
From the README (README.md:11-15):
## 🎯 Overview

Our approach addresses four critical quality dimensions for AI-generated videos:
- **Temporal Consistency**: Coherence across frames
- **Image Fidelity**: Visual quality and sharpness  
- **Aesthetic Appeal**: Artistic and visual attractiveness
- **Text-Video Alignment**: Correspondence between prompt and content

The 4 Quality Dimensions

1. Temporal Consistency

Temporal Consistency measures how smoothly and coherently a video transitions across frames. This is especially critical for AI-generated videos which often suffer from temporal artifacts.Key Aspects:
  • Frame-to-frame coherence
  • Motion smoothness
  • Object permanence (objects don’t appear/disappear)
  • Consistent lighting and colors across time
  • Stable backgrounds

2. Image Fidelity

Image Fidelity (also called Traditional MOS) assesses the technical quality of individual frames and the video as a whole.Key Aspects:
  • Sharpness and clarity
  • Resolution and detail
  • Absence of compression artifacts
  • Color accuracy
  • Noise levels
  • Contrast and brightness

3. Aesthetic Appeal

Aesthetic Appeal measures the artistic and visual attractiveness of the video, beyond pure technical quality.Key Aspects:
  • Composition and framing
  • Color harmony and palette
  • Visual creativity
  • Artistic style
  • Emotional impact
  • Overall visual appeal

4. Text-Video Alignment

Text-Video Alignment measures how well the generated video matches the input text prompt.Key Aspects:
  • Semantic correspondence (objects, actions, scenes)
  • Attribute accuracy (colors, sizes, quantities)
  • Action/motion alignment
  • Scene composition
  • Overall prompt fidelity

Overall MOS Score

The Overall MOS is a weighted combination of the 4 sub-dimensions:
# From dataset.py:37
MOS_COLS = ['Traditional_MOS', 'Alignment_MOS', 'Aesthetic_MOS', 'Temporal_MOS', 'Overall_MOS']
All 5 scores are predicted simultaneously by the model. The Overall MOS is not computed as a manual average, but rather learned end-to-end during training. This allows the model to learn adaptive weighting based on video characteristics.

MOS Score Scale

All scores follow the Mean Opinion Score (MOS) scale:
ScoreQuality LevelDescription
5.0ExcellentProfessional quality, no noticeable issues
4.0GoodHigh quality with minor imperfections
3.0FairAcceptable quality, some noticeable issues
2.0PoorSignificant quality problems
1.0BadSevere quality degradation
From README.md:56:
- **Quality Normalization**: MOS scores standardized to 1-5 scale

Quality-Aware Weighting

DOVER++ implements dynamic quality weighting based on text prompts: From dover_model.py:211-218:
# Quality aspect classifier
self.quality_classifier = nn.Sequential(
    nn.Linear(text_dim, hidden_dim),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(hidden_dim, 4),  # 4 quality aspects
    nn.Softmax(dim=-1)
)
How It Works:
  1. Text Analysis: Prompt is encoded and classified into quality aspects
  2. Weight Generation: 4 weights (sum to 1) for each dimension
  3. Adaptive Fusion: Model emphasizes relevant quality aspects
Example:
Prompt: “A dancer performing smooth, flowing movements”Quality Weights (example):
  • Traditional: 0.15
  • Alignment: 0.20
  • Aesthetic: 0.15
  • Temporal: 0.50 ← Highest weight
Model focuses on temporal consistency for motion-heavy content

Dataset Annotations

From the TaobaoVD-GC dataset (README.md:46-50):
video_name,Prompt,Traditional_MOS,Alignment_MOS,Aesthetic_MOS,Temporal_MOS,Overall_MOS
video001.mp4,"A cat playing piano",3.2,4.1,3.8,3.5,3.65
video002.mp4,"Sunset over mountains",4.5,4.2,4.8,4.1,4.4
Annotation Process:
  • Human raters watch videos
  • Score each dimension independently (1-5)
  • Multiple raters per video (averaged)
  • Overall MOS typically correlates with lowest sub-dimension

How Each Model Addresses Quality Dimensions

DimensionMechanism
Temporal Consistency3D ConvNeXt temporal convolutions (kernel size 7)
Image FidelityDedicated technical quality head on backbone features
Aesthetic AppealDedicated aesthetic quality head on backbone features
Text-Video AlignmentCross-modal attention between text and video features
Key Innovation: Explicit disentanglement of aesthetic vs. technical quality with separate prediction heads

Evaluation Metrics

Model performance measured per dimension: From config.py:109-111:
EVAL_CONFIG = {
    "metrics": ["spearman", "pearson"],
}
Metrics:
  • Spearman Correlation (SROCC): Rank-order correlation (preferred for MOS)
  • Pearson Correlation (PLCC): Linear correlation
Per-Dimension Evaluation:
for dimension in ['Traditional_MOS', 'Alignment_MOS', 'Aesthetic_MOS', 'Temporal_MOS', 'Overall_MOS']:
    srocc = spearmanr(predictions[dimension], ground_truth[dimension])
    plcc = pearsonr(predictions[dimension], ground_truth[dimension])

DOVER++ Model

Quality-aware architecture details

V-JEPA2 Model

ViT-based quality assessment

Data Preprocessing

How quality labels are processed

Build docs developers (and LLMs) love