Overview
DOVER++ is a state-of-the-art video quality assessment model that combines a ConvNeXt 3D backbone with quality-aware fusion for multi-modal assessment. The implementation includes separate heads for aesthetic and technical quality, with cross-modal attention for combining video and text features.DOVERModel
The main DOVER++ model class that integrates video encoding, text understanding, quality-aware fusion, and MOS prediction.Constructor
Path to DOVER++ pretrained weights file. If the file doesn’t exist, it will be automatically downloaded from HuggingFace.
HuggingFace model ID for the text encoder. Uses BGE-Large by default, with fallback to all-MiniLM-L6-v2 if loading fails.
Device to place the model on (‘cuda’ or ‘cpu’).
Methods
forward
Forward pass through the complete model with video frames and text prompts.Video frames tensor with shape (B, C, T, H, W) where:
- B = batch size
- C = channels (3 for RGB)
- T = number of frames
- H = height
- W = width
List of text prompts corresponding to each video in the batch.
MOS predictions with shape (B, 5) containing [Traditional, Alignment, Aesthetic, Temporal, Overall] quality scores.
get_quality_weights
Extract quality aspect weights for given text prompts.List of text prompts to analyze for quality aspects.
Quality weights tensor with shape (B, 4) representing [Traditional, Alignment, Aesthetic, Temporal] aspect weights.
extract_features
Extract intermediate features from video and text without MOS prediction.Video frames tensor with shape (B, C, T, H, W).
List of text prompts corresponding to each video.
Dictionary containing:
dover_features: DOVER feature vectors (B, 1024)text_features: Text embeddings (B, text_dim)fused_features: Quality-aware fused features (B, 256)quality_weights: Quality aspect weights (B, 4)aesthetic_score: Aesthetic quality scores (B, 1)technical_score: Technical quality scores (B, 1)
Usage Example
DOVERModelSimple
Simplified DOVER++ model with ConvNeXt 3D backbone, compatible with pretrained weights.Constructor
Device to place the model on (‘cuda’ or ‘cpu’).
Methods
forward
Forward pass through DOVER backbone and quality heads.Video tensor with shape (B, C, T, H, W).
Dictionary containing:
features: Feature vectors for fusion (B, 1024)aesthetic_score: Aesthetic quality scores (B, 1)technical_score: Technical quality scores (B, 1)backbone_features: Raw backbone features (B, 768, T’, H’, W’)
Usage Example
QualityAwareFusion
Quality-aware fusion module for combining DOVER++ and text features using cross-modal attention.Constructor
Dimension of DOVER feature vectors.
Dimension of text feature vectors.
Hidden dimension for cross-modal attention and fusion layers.
Methods
forward
Fuse DOVER++ and text features with quality-aware attention.DOVER features with shape (B, dover_dim).
Text features with shape (B, text_dim).
Tuple of (fused_features, quality_weights) where:
- fused_features: Shape (B, hidden_dim // 2)
- quality_weights: Shape (B, 4) - weights for 4 quality aspects
Usage Example
MOSPredictor
MOS prediction head that predicts 4 quality aspect scores plus an overall quality score.Constructor
Input feature dimension.
Hidden layer dimension.
Methods
forward
Predict MOS scores from fused features.Input features with shape (B, input_dim).
MOS predictions with shape (B, 5) containing [Traditional, Alignment, Aesthetic, Temporal, Overall] scores.
Usage Example
Architecture Details
ConvNeXt 3D Backbone
The DOVER++ model uses a ConvNeXt 3D backbone with the following architecture:- Stem: Conv3d(3→96, kernel=1×4×4, stride=1×4×4)
- Stage 1: 3 ConvNeXt blocks (96 channels) → Downsample to 192
- Stage 2: 3 ConvNeXt blocks (192 channels) → Downsample to 384
- Stage 3: 9 ConvNeXt blocks (384 channels) → Downsample to 768
- Stage 4: 3 ConvNeXt blocks (768 channels)
Quality Aspects
The model predicts scores for 4 quality aspects plus overall quality:- Traditional Quality: Standard video quality metrics (blur, noise, compression artifacts)
- Alignment: Semantic alignment between video content and text prompt
- Aesthetic Quality: Visual appeal and artistic quality
- Temporal Consistency: Smoothness and coherence across frames
- Overall Quality: Weighted combination of all aspects
Quality-Aware Fusion
The fusion module uses:- Quality Classifier: Analyzes text to determine focus areas (4 quality aspects)
- Cross-Modal Attention: 8-head attention for video-text alignment
- Feature Projection: Projects features to common hidden dimension
- Fusion Layers: LayerNorm + MLP for final feature combination