DOVER++ Model

Overview

DOVER++ (Disentangled Objective Video Quality Evaluator) is a state-of-the-art video quality assessment model that separates aesthetic and technical quality dimensions. QualiVision extends DOVER++ with quality-aware fusion mechanisms for multi-modal understanding.

DOVER++ Key Stats

Parameters: ~120 million
Input Resolution: 640×640
Frames: 64 per video
Memory: ~12GB GPU
Pretrained: HuggingFace weights available

Architecture Components

1. ConvNeXt 3D Backbone

The backbone uses a modern ConvNeXt architecture adapted for 3D video processing:

Overall Structure
ConvNeXt Block
Feature Dimensions

From dover_model.py:128-152, the backbone consists of 4 stages:

def _build_convnext_backbone(self) -> nn.Module:
    return nn.Sequential(
        # Stem: 3 -> 96 channels
        nn.Conv3d(3, 96, kernel_size=(1, 4, 4), stride=(1, 4, 4)),
        nn.GroupNorm(1, 96),
        
        # Stage 1: 96 channels, 3 blocks
        *[self._make_convnext_block(96) for _ in range(3)],
        nn.Conv3d(96, 192, kernel_size=(1, 2, 2), stride=(1, 2, 2)),
        
        # Stage 2: 192 channels, 3 blocks
        *[self._make_convnext_block(192) for _ in range(3)],
        nn.Conv3d(192, 384, kernel_size=(1, 2, 2), stride=(1, 2, 2)),
        
        # Stage 3: 384 channels, 9 blocks
        *[self._make_convnext_block(384) for _ in range(9)],
        nn.Conv3d(384, 768, kernel_size=(1, 2, 2), stride=(1, 2, 2)),
        
        # Stage 4: 768 channels, 3 blocks
        *[self._make_convnext_block(768) for _ in range(3)],
    )

Design Highlights:

Progressive channel expansion: 96 → 192 → 384 → 768
Spatial downsampling with (1, 2, 2) convolutions (preserves temporal)
18 total ConvNeXt blocks with varying depths per stage

Each ConvNeXt block implements the modern architecture pattern:From dover_model.py:154-162:

def _make_convnext_block(self, dim: int) -> nn.Module:
    return nn.Sequential(
        nn.Conv3d(dim, dim, kernel_size=7, padding=3, groups=dim),  # Depthwise
        nn.GroupNorm(1, dim),
        nn.Conv3d(dim, dim * 4, kernel_size=1),  # Expand
        nn.GELU(),
        nn.Conv3d(dim * 4, dim, kernel_size=1),  # Contract
    )

Key Features:

Depthwise convolution: 7×7×7 kernel for spatial-temporal patterns
Inverted bottleneck: Expand to 4× channels, then contract
GELU activation: Smooth, modern activation function
GroupNorm: Stable normalization for small batch sizes

Input and output dimensions through the backbone:

Stage	Input Dim	Output Dim	Spatial Size
Input	(B, 3, 64, 640, 640)	-	-
Stem	(B, 3, 64, 640, 640)	(B, 96, 64, 160, 160)	÷4
Stage 1	(B, 96, 64, 160, 160)	(B, 192, 64, 80, 80)	÷2
Stage 2	(B, 192, 64, 80, 80)	(B, 384, 64, 40, 40)	÷2
Stage 3	(B, 384, 64, 40, 40)	(B, 768, 64, 20, 20)	÷2
Stage 4	(B, 768, 64, 20, 20)	(B, 768, 64, 20, 20)	-

2. Disentangled Quality Heads

DOVER++ separates quality into aesthetic and technical dimensions:

Aesthetic Head

Evaluates artistic and visual appeal aspects:

Color harmony
Composition
Visual creativity
Artistic style

Technical Head

Assesses technical quality factors:

Sharpness
Artifacts
Noise levels
Compression quality

From dover_model.py:99-116:

# Separate heads for aesthetic and technical quality
self.aesthetic_head = nn.Sequential(
    nn.AdaptiveAvgPool3d(1),
    nn.Flatten(),
    nn.Linear(768, 256),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(256, 1)
)

self.technical_head = nn.Sequential(
    nn.AdaptiveAvgPool3d(1),
    nn.Flatten(),
    nn.Linear(768, 256),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(256, 1)
)

3. Quality-Aware Fusion Mechanism

The fusion module combines DOVER++ video features with text embeddings using cross-modal attention:

Quality Aspect Classification

Analyze text prompt to determine which quality aspects are emphasized:From dover_model.py:211-218:

self.quality_classifier = nn.Sequential(
    nn.Linear(text_dim, hidden_dim),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(hidden_dim, 4),  # 4 quality aspects
    nn.Softmax(dim=-1)
)

Outputs 4 weights for: Traditional, Alignment, Aesthetic, Temporal

Feature Projection

Project both modalities to common dimension:

self.dover_proj = nn.Linear(dover_dim, hidden_dim)  # 1024 -> 512
self.text_proj = nn.Linear(text_dim, hidden_dim)    # 1024 -> 512

Cross-Modal Attention

From dover_model.py:220-226:

self.cross_attention = nn.MultiheadAttention(
    embed_dim=hidden_dim,
    num_heads=8,
    dropout=0.1,
    batch_first=True
)

Text queries attend to video features to extract relevant quality information

Feature Fusion

From dover_model.py:265-277:

# Cross-modal attention
attended_dover, _ = self.cross_attention(
    query=text_proj_seq,
    key=dover_proj_seq,
    value=dover_proj_seq
)

# Concatenate and fuse
combined_features = torch.cat([attended_dover, text_proj], dim=-1)
fused_features = self.fusion_layer(combined_features)

Why Cross-Modal Attention? Text prompts like “smooth camera motion” or “vibrant colors” guide the model to focus on specific quality aspects. The attention mechanism allows the model to selectively emphasize relevant video features based on the text description.

4. MOS Prediction Head

The final prediction head generates 5 MOS scores from fused features: From dover_model.py:286-303:

class MOSPredictor(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int = 256):
        super().__init__()
        
        self.predictor = nn.Sequential(
            nn.LayerNorm(input_dim),
            nn.Linear(input_dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(0.15),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.GELU(),
            nn.Dropout(0.15),
            nn.Linear(hidden_dim // 2, 5)  # 4 sub-MOS + Overall
        )

Output Scores:

Traditional MOS: Overall technical quality
Alignment MOS: Text-video correspondence
Aesthetic MOS: Visual appeal
Temporal MOS: Motion smoothness
Overall MOS: Weighted combination

Forward Pass

The complete forward pass integrates all components:

View Complete Forward Pass Code

From dover_model.py:372-402:

def forward(self, frames: torch.Tensor, prompts: List[str]) -> torch.Tensor:
    """
    Args:
        frames: Video frames tensor (B, C, T, H, W)
        prompts: List of text prompts
        
    Returns:
        MOS predictions (B, 5) - [Traditional, Alignment, Aesthetic, Temporal, Overall]
    """
    # Extract DOVER++ features
    dover_output = self.dover_model(frames)
    dover_features = dover_output['features']  # (B, 1024)
    
    # Extract text features
    with torch.no_grad():
        text_features = self.text_encoder.encode(
            prompts,
            convert_to_tensor=True,
            normalize_embeddings=True,
            device=self.device
        )  # (B, 1024)
    
    # Quality-aware fusion
    fused_features, quality_weights = self.fusion(dover_features, text_features)
    # fused_features: (B, 256)
    # quality_weights: (B, 4)
    
    # Predict MOS scores
    mos_predictions = self.mos_predictor(fused_features)  # (B, 5)
    
    return mos_predictions

Technical Specifications

Model Specs
Training Config
Memory Usage

Parameter	Value
Total Parameters	~120M
Trainable Parameters	~120M
Input Resolution	640×640
Input Frames	64
Backbone Channels	768
Feature Dimension	1024
Hidden Dimension	512
Output Dimension	5 (MOS scores)

From config.py:21-35:

DOVER_CONFIG = {
    "model_name": "DOVER++",
    "video_resolution": (640, 640),
    "num_frames": 64,
    "text_encoder": "BAAI/bge-large-en-v1.5",
    "dover_dim": 1024,
    "text_dim": 1024,
    "hidden_dim": 512,
    "batch_size": 4,
    "learning_rate": 1e-4,
    "epochs": 5,
    "gradient_accumulation_steps": 8,
    "effective_batch_size": 32
}

Component	Memory
Model Parameters	~480 MB
Single Batch (B=4)	~8 GB
Gradients	~480 MB
Optimizer States (Adam)	~960 MB
Activations	~2 GB
Total	~12 GB

With gradient accumulation (8 steps), effective batch size of 32 can be achieved on a single 12GB GPU.

Pretrained Weights

DOVER++ uses pretrained weights from the original DOVER project: From dover_model.py:36-43:

# Download weights if not exists
if not os.path.exists(weights_path):
    os.makedirs(os.path.dirname(weights_path), exist_ok=True)
    print(f"Downloading DOVER++ weights to {weights_path}")
    urllib.request.urlretrieve(
        "https://huggingface.co/teowu/DOVER/resolve/main/DOVER_plus_plus.pth",
        weights_path
    )

The pretrained weights are loaded with strict=False to allow for architecture modifications in the fusion and prediction heads. Only the ConvNeXt backbone weights are transferred.

Feature Extraction API

Extract intermediate features for analysis:

features = model.extract_features(frames, prompts)
# Returns:
# {
#     'dover_features': (B, 1024),
#     'text_features': (B, 1024),
#     'fused_features': (B, 256),
#     'quality_weights': (B, 4),
#     'aesthetic_score': (B, 1),
#     'technical_score': (B, 1)
# }

Advantages

Disentangled Quality

Separates aesthetic and technical aspects for interpretable assessment

Cross-Modal Fusion

Attention mechanism aligns video features with text guidance

Pretrained Backbone

Leverages high-quality pretrained weights from DOVER project

Quality-Aware

Dynamically weights quality aspects based on text prompt

V-JEPA2 Model

Alternative architecture with ViT backbone

Quality Dimensions

Understanding the 4 quality metrics

Architecture

Overall system architecture

Get Started

Core Concepts

Guides

Overview

Architecture Components

1. ConvNeXt 3D Backbone

2. Disentangled Quality Heads

Aesthetic Head

Technical Head

3. Quality-Aware Fusion Mechanism

4. MOS Prediction Head

Forward Pass

Technical Specifications

Pretrained Weights

Feature Extraction API

Advantages

Disentangled Quality

Cross-Modal Fusion

Pretrained Backbone

Quality-Aware

V-JEPA2 Model

Quality Dimensions

Architecture

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Overview

​Architecture Components

​1. ConvNeXt 3D Backbone

​2. Disentangled Quality Heads

Aesthetic Head

Technical Head

​3. Quality-Aware Fusion Mechanism

​4. MOS Prediction Head

​Forward Pass

​Technical Specifications

​Pretrained Weights

​Feature Extraction API

​Advantages

Disentangled Quality

Cross-Modal Fusion

Pretrained Backbone

Quality-Aware

​Related Topics

V-JEPA2 Model

Quality Dimensions

Architecture

Build docs developers (and LLMs) love

Overview

Architecture Components

1. ConvNeXt 3D Backbone

2. Disentangled Quality Heads

3. Quality-Aware Fusion Mechanism

4. MOS Prediction Head

Forward Pass

Technical Specifications

Pretrained Weights

Feature Extraction API

Advantages

Related Topics