Quality Dimensions

Overview

QualiVision evaluates AI-generated videos across 4 critical quality dimensions, each capturing a distinct aspect of video quality. These dimensions are designed specifically for AI-generated content where traditional quality metrics may fall short.

Temporal Consistency

Coherence and smoothness across video frames

Image Fidelity

Visual quality, sharpness, and technical excellence

Aesthetic Appeal

Artistic quality and visual attractiveness

Text-Video Alignment

Correspondence between prompt and generated content

From the README (README.md:11-15):

## 🎯 Overview

Our approach addresses four critical quality dimensions for AI-generated videos:
- **Temporal Consistency**: Coherence across frames
- **Image Fidelity**: Visual quality and sharpness  
- **Aesthetic Appeal**: Artistic and visual attractiveness
- **Text-Video Alignment**: Correspondence between prompt and content

The 4 Quality Dimensions

1. Temporal Consistency

Definition
Common Issues
How Models Assess It

Temporal Consistency measures how smoothly and coherently a video transitions across frames. This is especially critical for AI-generated videos which often suffer from temporal artifacts.Key Aspects:

Frame-to-frame coherence
Motion smoothness
Object permanence (objects don’t appear/disappear)
Consistent lighting and colors across time
Stable backgrounds

DOVER++: The ConvNeXt 3D backbone with temporal convolutions:

# 3D convolutions capture temporal patterns
nn.Conv3d(dim, dim, kernel_size=7, padding=3, groups=dim)
# kernel_size=7 includes temporal dimension

V-JEPA2: The ViT processes all 64 frames jointly:

Self-attention across frame patches
CLS token aggregates temporal information
Pretrained on video data captures motion priors

2. Image Fidelity

Definition
Metrics
How Models Assess It

Image Fidelity (also called Traditional MOS) assesses the technical quality of individual frames and the video as a whole.Key Aspects:

Sharpness and clarity
Resolution and detail
Absence of compression artifacts
Color accuracy
Noise levels
Contrast and brightness

Traditional image quality metrics that contribute:

Metric	Measures
PSNR	Peak signal-to-noise ratio
SSIM	Structural similarity
Sharpness	Edge strength and clarity
Noise	Unwanted variations
Artifacts	Compression/generation errors

DOVER++: Separate technical quality head:From dover_model.py:109-116:

self.technical_head = nn.Sequential(
    nn.AdaptiveAvgPool3d(1),
    nn.Flatten(),
    nn.Linear(768, 256),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(256, 1)  # Technical quality score
)

V-JEPA2: High-resolution features (1408-dim) from ViT-Giant capture fine-grained quality details

3. Aesthetic Appeal

Definition
Subjective Nature
How Models Assess It

Aesthetic Appeal measures the artistic and visual attractiveness of the video, beyond pure technical quality.Key Aspects:

Composition and framing
Color harmony and palette
Visual creativity
Artistic style
Emotional impact
Overall visual appeal

DOVER++: Dedicated aesthetic quality head:From dover_model.py:99-107:

self.aesthetic_head = nn.Sequential(
    nn.AdaptiveAvgPool3d(1),
    nn.Flatten(),
    nn.Linear(768, 256),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(256, 1)  # Aesthetic quality score
)

Disentangled from Technical Quality: DOVER++ explicitly separates aesthetic from technical assessment, recognizing that a video can be technically perfect but aesthetically boring, or vice versa.

4. Text-Video Alignment

Definition
Examples
How Models Assess It

Text-Video Alignment measures how well the generated video matches the input text prompt.Key Aspects:

Semantic correspondence (objects, actions, scenes)
Attribute accuracy (colors, sizes, quantities)
Action/motion alignment
Scene composition
Overall prompt fidelity

Cross-Modal Understanding is key. Both models encode text and video, then compare:DOVER++: Quality-aware fusion with cross-modal attentionFrom dover_model.py:265-277:

# Text queries attend to video features
attended_dover, _ = self.cross_attention(
    query=text_proj_seq,      # Text guides attention
    key=dover_proj_seq,        # Video features
    value=dover_proj_seq       # Video features
)

# Combine attended video + text
combined_features = torch.cat([attended_dover, text_proj], dim=-1)
fused_features = self.fusion_layer(combined_features)

V-JEPA2: Simple concatenation leverages pretrained embeddingsFrom vjepa_model.py:46-57:

def forward(self, v: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
    # v: Video features from ViT (B, 1408)
    # t: Text features from BGE (B, 768)
    return self.net(torch.cat([v, t], dim=-1))  # Joint prediction

Overall MOS Score

The Overall MOS is a weighted combination of the 4 sub-dimensions:

# From dataset.py:37
MOS_COLS = ['Traditional_MOS', 'Alignment_MOS', 'Aesthetic_MOS', 'Temporal_MOS', 'Overall_MOS']

All 5 scores are predicted simultaneously by the model. The Overall MOS is not computed as a manual average, but rather learned end-to-end during training. This allows the model to learn adaptive weighting based on video characteristics.

MOS Score Scale

All scores follow the Mean Opinion Score (MOS) scale:

Score	Quality Level	Description
5.0	Excellent	Professional quality, no noticeable issues
4.0	Good	High quality with minor imperfections
3.0	Fair	Acceptable quality, some noticeable issues
2.0	Poor	Significant quality problems
1.0	Bad	Severe quality degradation

From README.md:56:

- **Quality Normalization**: MOS scores standardized to 1-5 scale

Quality-Aware Weighting

DOVER++ implements dynamic quality weighting based on text prompts: From dover_model.py:211-218:

# Quality aspect classifier
self.quality_classifier = nn.Sequential(
    nn.Linear(text_dim, hidden_dim),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(hidden_dim, 4),  # 4 quality aspects
    nn.Softmax(dim=-1)
)

How It Works:

Text Analysis: Prompt is encoded and classified into quality aspects
Weight Generation: 4 weights (sum to 1) for each dimension
Adaptive Fusion: Model emphasizes relevant quality aspects

Example:

Motion-Focused Prompt
Aesthetic Prompt
Object-Focused Prompt

Prompt: “A dancer performing smooth, flowing movements”Quality Weights (example):

Traditional: 0.15
Alignment: 0.20
Aesthetic: 0.15
Temporal: 0.50 ← Highest weight

Model focuses on temporal consistency for motion-heavy content

Dataset Annotations

From the TaobaoVD-GC dataset (README.md:46-50):

video_name,Prompt,Traditional_MOS,Alignment_MOS,Aesthetic_MOS,Temporal_MOS,Overall_MOS
video001.mp4,"A cat playing piano",3.2,4.1,3.8,3.5,3.65
video002.mp4,"Sunset over mountains",4.5,4.2,4.8,4.1,4.4

Annotation Process:

Human raters watch videos
Score each dimension independently (1-5)
Multiple raters per video (averaged)
Overall MOS typically correlates with lowest sub-dimension

How Each Model Addresses Quality Dimensions

DOVER++
V-JEPA2
Comparison

Dimension	Mechanism
Temporal Consistency	3D ConvNeXt temporal convolutions (kernel size 7)
Image Fidelity	Dedicated technical quality head on backbone features
Aesthetic Appeal	Dedicated aesthetic quality head on backbone features
Text-Video Alignment	Cross-modal attention between text and video features

Key Innovation: Explicit disentanglement of aesthetic vs. technical quality with separate prediction heads

Dimension	Mechanism
Temporal Consistency	Self-attention across 64 frame patches in ViT
Image Fidelity	High-resolution ViT-Giant features (1408-dim)
Aesthetic Appeal	Pretrained on diverse video data captures aesthetic patterns
Text-Video Alignment	Concatenated video-text features in prediction head

Key Innovation: Massive pretrained ViT with strategic freezing for memory-efficient fine-tuning

Aspect	DOVER++	V-JEPA2
Temporal modeling	Explicit 3D convolutions	Implicit self-attention
Aesthetic/technical	Disentangled heads	Unified representation
Text fusion	Cross-modal attention	Simple concatenation
Specialization	Quality-specific design	General video understanding
Best for	Fine-grained quality	Semantic alignment

Evaluation Metrics

Model performance measured per dimension: From config.py:109-111:

EVAL_CONFIG = {
    "metrics": ["spearman", "pearson"],
}

Metrics:

Spearman Correlation (SROCC): Rank-order correlation (preferred for MOS)
Pearson Correlation (PLCC): Linear correlation

Per-Dimension Evaluation:

for dimension in ['Traditional_MOS', 'Alignment_MOS', 'Aesthetic_MOS', 'Temporal_MOS', 'Overall_MOS']:
    srocc = spearmanr(predictions[dimension], ground_truth[dimension])
    plcc = pearsonr(predictions[dimension], ground_truth[dimension])

DOVER++ Model

Quality-aware architecture details

V-JEPA2 Model

ViT-based quality assessment

Data Preprocessing

How quality labels are processed

Get Started

Core Concepts

Guides

Overview

Temporal Consistency

Image Fidelity

Aesthetic Appeal

Text-Video Alignment

The 4 Quality Dimensions

1. Temporal Consistency

2. Image Fidelity

3. Aesthetic Appeal

4. Text-Video Alignment

Overall MOS Score

MOS Score Scale

Quality-Aware Weighting

Dataset Annotations

How Each Model Addresses Quality Dimensions

Evaluation Metrics

DOVER++ Model

V-JEPA2 Model

Data Preprocessing

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Overview

Temporal Consistency

Image Fidelity

Aesthetic Appeal

Text-Video Alignment

​The 4 Quality Dimensions

​1. Temporal Consistency

​2. Image Fidelity

​3. Aesthetic Appeal

​4. Text-Video Alignment

​Overall MOS Score

​MOS Score Scale

​Quality-Aware Weighting

​Dataset Annotations

​How Each Model Addresses Quality Dimensions

​Evaluation Metrics

​Related Topics

DOVER++ Model

V-JEPA2 Model

Data Preprocessing

Build docs developers (and LLMs) love

Overview

The 4 Quality Dimensions

1. Temporal Consistency

2. Image Fidelity

3. Aesthetic Appeal

4. Text-Video Alignment

Overall MOS Score

MOS Score Scale

Quality-Aware Weighting

Dataset Annotations

How Each Model Addresses Quality Dimensions

Evaluation Metrics

Related Topics