Overview
QualiVision implements a dual-model architecture for comprehensive video quality assessment of AI-generated content. The system combines two state-of-the-art models (DOVER++ and V-JEPA2) with multi-modal fusion to assess video quality across four critical dimensions.High-Level Architecture
The system follows a modular pipeline architecture:Component Breakdown
1. Data Pipeline
The data pipeline handles video ingestion, frame sampling, and preprocessing:- Frame Sampling
- Resolution Adaptation
- Text Processing
- Uniform temporal sampling: 64 frames extracted uniformly across video duration
- Adaptive indexing: Handles videos shorter or longer than 64 frames
- Decord integration: GPU-accelerated video reading
dataset.py:82-96:2. Model Components
DOVER++ Model
- Backbone: ConvNeXt 3D (768 channels)
- Parameters: ~120M
- Input: 640×640×64 frames
- Features: Separate aesthetic/technical heads
- Fusion: Quality-aware cross-modal attention
V-JEPA2 Model
- Backbone: Vision-JEPA2 ViT-Giant
- Parameters: ~1.1B (85% frozen)
- Input: 384×384×64 frames
- Features: Strategic layer freezing
- Fusion: Concatenation with discriminative learning
3. Training Pipeline
The training system implements advanced techniques for stable convergence:Key Training Features
- Mixed precision: FP16/FP32 for memory efficiency
- Gradient accumulation: Effective batch sizes of 32-192
- Discriminative learning rates: Different rates for encoder/fusion/head
- Adaptive loss weighting: Dynamic adjustment during training
config.py:59-76:
4. Evaluation System
Multi-metric evaluation for comprehensive assessment:- Spearman Correlation (SROCC): Rank-order correlation
- Pearson Correlation (PLCC): Linear correlation
- Per-dimension metrics: Separate evaluation for each quality aspect
Data Flow Explanation
Training Flow
- Video Loading: Videos are loaded from disk using Decord VideoReader
- Frame Sampling: 64 frames uniformly sampled from each video
- Preprocessing: Frames resized to target resolution (640×640 or 384×384)
- Text Encoding: Prompts encoded to dense embeddings using BGE-Large
- Model Forward Pass:
- Video features extracted by ConvNeXt 3D or ViT-Giant
- Text features projected to common dimension
- Cross-modal fusion combines video and text
- MOS Prediction: Fusion features predict 5 MOS scores
- Loss Computation: Hybrid loss (smooth L1 + ranking + scale-aware)
- Backpropagation: Discriminative learning rates update parameters
Inference Flow
- Input: Single video + text prompt
- Preprocessing: Frame sampling and resolution adaptation
- Feature Extraction: Parallel video and text encoding
- Fusion: Quality-aware feature combination
- Prediction: 5 MOS scores (Traditional, Alignment, Aesthetic, Temporal, Overall)
- Output: Normalized scores in 1-5 range
How the Pieces Fit Together
Parallel Processing
Video encoder (DOVER++/V-JEPA2) and text encoder (BGE-Large) process inputs simultaneously
Multi-Modal Fusion
Cross-modal attention or concatenation combines video and text features based on quality aspects
Memory and Performance
- DOVER++
- V-JEPA2
- GPU Memory: ~12GB
- Batch Size: 4 samples
- Effective Batch: 32 (with gradient accumulation)
- Training Time: ~5 epochs
Design Philosophy: QualiVision prioritizes modularity and extensibility. Each component (data loading, model architecture, fusion mechanism, prediction head) can be independently modified or replaced without affecting the entire system.
Related Topics
DOVER++ Model
Deep dive into the DOVER++ architecture
V-JEPA2 Model
Explore the V-JEPA2 implementation
Quality Dimensions
Understanding the 4 quality metrics
Data Preprocessing
Details on data preparation pipeline