QualiVision evaluates AI-generated videos across 4 critical quality dimensions, each capturing a distinct aspect of video quality. These dimensions are designed specifically for AI-generated content where traditional quality metrics may fall short.
Temporal Consistency
Coherence and smoothness across video frames
Image Fidelity
Visual quality, sharpness, and technical excellence
Aesthetic Appeal
Artistic quality and visual attractiveness
Text-Video Alignment
Correspondence between prompt and generated content
From the README (README.md:11-15):
## 🎯 OverviewOur approach addresses four critical quality dimensions for AI-generated videos:- **Temporal Consistency**: Coherence across frames- **Image Fidelity**: Visual quality and sharpness - **Aesthetic Appeal**: Artistic and visual attractiveness- **Text-Video Alignment**: Correspondence between prompt and content
Temporal Consistency measures how smoothly and coherently a video transitions across frames. This is especially critical for AI-generated videos which often suffer from temporal artifacts.Key Aspects:
Problems in temporal consistency for AI-generated videos:
Flickering: Colors or brightness jumping between frames
Morphing: Objects changing shape unexpectedly
Discontinuities: Sudden jumps or cuts in motion
Drift: Gradual changes in style or appearance
Jitter: Unstable camera or object movements
Temporal inconsistency is the most common failure mode in AI video generation. Even state-of-the-art models struggle with maintaining coherence over long sequences.
DOVER++: The ConvNeXt 3D backbone with temporal convolutions:
# 3D convolutions capture temporal patternsnn.Conv3d(dim, dim, kernel_size=7, padding=3, groups=dim)# kernel_size=7 includes temporal dimension
Aesthetic Appeal measures the artistic and visual attractiveness of the video, beyond pure technical quality.Key Aspects:
Composition and framing
Color harmony and palette
Visual creativity
Artistic style
Emotional impact
Overall visual appeal
Aesthetic appeal is the most subjective dimension. What looks beautiful varies across cultures, contexts, and individual preferences. AI models learn aesthetic preferences from training data annotations.
Disentangled from Technical Quality: DOVER++ explicitly separates aesthetic from technical assessment, recognizing that a video can be technically perfect but aesthetically boring, or vice versa.
Prompt: “A cat playing piano in a sunlit room”✅ Good: Shows a cat, near a piano, with paws on keys, in a bright room✅ Perfect: Cat actively pressing piano keys, visible sunlight streaming through window❌ Poor: Cat sitting near piano but not interacting❌ Bad: Cat in dark room, no piano visible
Show Challenging Alignment Cases
Prompt: “A robot juggling three red balls in a garden”Challenges:
Cross-Modal Understanding is key. Both models encode text and video, then compare:DOVER++: Quality-aware fusion with cross-modal attentionFrom dover_model.py:265-277:
# Text queries attend to video featuresattended_dover, _ = self.cross_attention( query=text_proj_seq, # Text guides attention key=dover_proj_seq, # Video features value=dover_proj_seq # Video features)# Combine attended video + textcombined_features = torch.cat([attended_dover, text_proj], dim=-1)fused_features = self.fusion_layer(combined_features)
def forward(self, v: torch.Tensor, t: torch.Tensor) -> torch.Tensor: # v: Video features from ViT (B, 1408) # t: Text features from BGE (B, 768) return self.net(torch.cat([v, t], dim=-1)) # Joint prediction
The Overall MOS is a weighted combination of the 4 sub-dimensions:
# From dataset.py:37MOS_COLS = ['Traditional_MOS', 'Alignment_MOS', 'Aesthetic_MOS', 'Temporal_MOS', 'Overall_MOS']
All 5 scores are predicted simultaneously by the model. The Overall MOS is not computed as a manual average, but rather learned end-to-end during training. This allows the model to learn adaptive weighting based on video characteristics.
video_name,Prompt,Traditional_MOS,Alignment_MOS,Aesthetic_MOS,Temporal_MOS,Overall_MOSvideo001.mp4,"A cat playing piano",3.2,4.1,3.8,3.5,3.65video002.mp4,"Sunset over mountains",4.5,4.2,4.8,4.1,4.4
Annotation Process:
Human raters watch videos
Score each dimension independently (1-5)
Multiple raters per video (averaged)
Overall MOS typically correlates with lowest sub-dimension