V-JEPA2 Model

Overview

V-JEPA2 (Video Joint Embedding Predictive Architecture v2) is a Vision Transformer-based model specifically designed for video understanding. QualiVision adapts V-JEPA2 with strategic layer freezing (85%) and discriminative learning rates for memory-efficient video quality assessment.

V-JEPA2 Key Stats

Parameters: ~1.1 billion (85% frozen)
Trainable: ~165 million parameters
Input Resolution: 384×384
Frames: 64 per video
Memory: ~16GB GPU
Architecture: ViT-Giant (Vision Transformer)

Architecture Components

1. Vision-JEPA2 ViT-Giant Backbone

The backbone is a massive Vision Transformer pretrained on video data:

Model Structure
ViT Architecture
Video Processor

From vjepa_model.py:93-102:

# Video encoder - Use FP32 for stable gradients
self.venc = AutoModel.from_pretrained(
    vjepa_model_id,  # "facebook/vjepa2-vitg-fpc64-384-ssv2"
    torch_dtype=torch.float32,
    output_hidden_states=True,
    attn_implementation="sdpa",  # Scaled dot-product attention
)

Key Configuration:

Model: facebook/vjepa2-vitg-fpc64-384-ssv2
Precision: FP32 for stable gradients (not FP16)
Attention: SDPA (PyTorch’s optimized attention)
Hidden States: Enabled for feature extraction

The ViT-Giant architecture consists of:

Component	Configuration
Patch Size	16×16
Hidden Size	1408
Num Layers	40
Num Heads	16
Intermediate Size	6144
Total Params	~1.1B

Input Processing:

Video split into 16×16 patches
Each patch embedded to 1408-dim vector
Positional embeddings added
40 transformer layers process sequence

From vjepa_model.py:94:

self.vproc = AutoVideoProcessor.from_pretrained(vjepa_model_id)

The processor handles:

Frame normalization (mean/std)
Resolution adaptation (384×384)
Pixel value scaling
Tensor format conversion

2. Strategic Layer Freezing

The most innovative aspect is freezing 85% of the transformer layers to reduce memory and improve training efficiency:

Count Total Layers

From vjepa_model.py:140-144:

for name, p in self.venc.named_parameters():
    if "encoder.layer." in name:
        layer_match = name.split("encoder.layer.")[1].split(".")[0]
        if layer_match.isdigit():
            total_layers = max(total_layers, int(layer_match) + 1)

Detects 40 transformer layers in ViT-Giant

Determine Freeze Boundary

freeze_until_layer = int(total_layers * self.freeze_ratio)  # 0.85 * 40 = 34
print(f"Freezing layers 0-{freeze_until_layer-1}, training layers {freeze_until_layer}-{total_layers-1}")
# Output: "Freezing layers 0-33, training layers 34-39"

Only the top 6 layers remain trainable

Apply Freezing

From vjepa_model.py:152-173:

for name, p in self.venc.named_parameters():
    should_freeze = False
    
    # Always freeze embeddings and pooler
    if "embeddings" in name or "pooler" in name:
        should_freeze = True
    
    # Freeze bottom layers
    elif "encoder.layer." in name:
        layer_match = name.split("encoder.layer.")[1].split(".")[0]
        if layer_match.isdigit():
            layer_num = int(layer_match)
            if layer_num < freeze_until_layer:
                should_freeze = True
    
    # Apply freezing
    if should_freeze:
        p.requires_grad = False

Memory Savings

From vjepa_model.py:175-179:

print(f"Layer freezing applied:")
print(f"  Frozen parameters: {frozen_count:,}")
print(f"  Trainable parameters: {trainable_count:,}")
print(f"  Memory savings: ~{(frozen_count/(frozen_count+trainable_count))*100:.0f}% reduction in gradient computation")

Result: ~85% reduction in gradient memory

Why 85% Freezing? The bottom layers learn general visual features (edges, textures, basic shapes) that transfer well across tasks. Only the top layers need fine-tuning for task-specific quality assessment. This dramatically reduces memory while maintaining performance.

3. Text Encoder Integration

BGE-Large encodes text prompts: From vjepa_model.py:112:

self.tenc = SentenceTransformer(text_model_id, device=device)
# "BAAI/bge-large-en-v1.5"

Text Encoding Process:

def encode_text(self, prompts: List[str]) -> torch.Tensor:
    with torch.no_grad():
        text_emb = self.tenc.encode(
            prompts,
            convert_to_tensor=True,
            normalize_embeddings=True,
            device=self.device
        )
    return text_emb  # (B, 768)

4. Optimized MOS Prediction Head

Simple yet effective concatenation-based fusion:

Head Architecture
Forward Pass

From vjepa_model.py:22-42:

class OptimizedMOSHead(nn.Module):
    def __init__(self, dv: int, dt: int, h: int = 512):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.LayerNorm(dv + dt),  # 1408 + 768 = 2176
            nn.Linear(dv + dt, h),   # -> 512
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(h, h // 2),    # -> 256
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(h // 2, 5)     # -> 5 MOS scores
        )

Key Design Choices:

LayerNorm first: Normalizes concatenated features
Two-layer MLP: 512 → 256 → 5 dimensions
GELU activation: Smooth, modern activation
Dropout 0.1: Regularization for generalization

From vjepa_model.py:46-57:

def forward(self, v: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
    """
    Args:
        v: Video features (B, 1408) - CLS token from ViT
        t: Text features (B, 768)   - BGE embeddings
        
    Returns:
        MOS predictions (B, 5)
    """
    return self.net(torch.cat([v, t], dim=-1))

Simple concatenation + MLP is surprisingly effective!

Discriminative Learning Rates

Different parameter groups get different learning rates:

Text Encoder

Lowest LR: 0.1× base rateText embeddings are already well-trained, need minimal adjustment

Video Encoder

Medium LR: 0.5× base rateFine-tune top 6 unfrozen layers moderately

Prediction Head

Highest LR: 2.0× base rateHead needs aggressive training from scratch

From vjepa_model.py:221-243:

def get_discriminative_params(self) -> List[Dict[str, Any]]:
    # Text encoder parameters (lowest lr)
    text_params = list(self.tenc.parameters())
    
    # Video encoder parameters (medium lr)
    video_params = [p for p in self.venc.parameters() if p.requires_grad]
    
    # Prediction head parameters (highest lr)
    head_params = list(self.head.parameters())
    
    param_groups = [
        {'params': text_params, 'name': 'text_encoder'},
        {'params': video_params, 'name': 'video_encoder'},
        {'params': head_params, 'name': 'prediction_head'}
    ]
    
    return param_groups

Usage in Optimizer:

param_groups = model.get_discriminative_params()
optimizer = torch.optim.AdamW([
    {'params': param_groups[0]['params'], 'lr': base_lr * 0.1},  # text
    {'params': param_groups[1]['params'], 'lr': base_lr * 0.5},  # video
    {'params': param_groups[2]['params'], 'lr': base_lr * 2.0},  # head
])

From config.py:52-56:

"discriminative_lr": {
    "text": 0.1,
    "video": 0.5,
    "head": 2.0
}

Forward Pass

Complete inference pipeline:

View Complete Forward Pass Code

From vjepa_model.py:193-219:

def forward(self, pixel_values_videos: torch.Tensor, text_emb: torch.Tensor) -> torch.Tensor:
    """
    Args:
        pixel_values_videos: Video tensor (B, C, T, H, W)
        text_emb: Text embeddings (B, text_dim)
        
    Returns:
        MOS predictions (B, 5) - [Traditional, Alignment, Aesthetic, Temporal, Overall]
    """
    # Ensure FP32 for stable gradients
    pixel_values_videos = pixel_values_videos.to(self.venc.device, dtype=torch.float32)
    
    # Forward pass through video encoder
    outputs = self.venc(pixel_values_videos=pixel_values_videos, output_hidden_states=True)
    
    # Get CLS token (first token)
    cls_token = outputs.last_hidden_state[:, 0]  # (B, 1408)
    
    # Ensure text embeddings match video features
    text_emb = text_emb.to(cls_token.device, dtype=cls_token.dtype)
    
    # MOS prediction
    mos_scores = self.head(cls_token, text_emb)
    
    return mos_scores

CLS Token: Following ViT tradition, the first token (CLS token) serves as the video-level representation. This single 1408-dimensional vector encodes the entire video’s semantic content.

Technical Specifications

Model Specs
Training Config
Memory Usage
Performance

Parameter	Value
Total Parameters	~1.1B
Frozen Parameters	~935M (85%)
Trainable Parameters	~165M (15%)
Input Resolution	384×384
Input Frames	64
Video Feature Dim	1408
Text Feature Dim	768
Hidden Dimension	512
Output Dimension	5

From config.py:37-57:

VJEPA_CONFIG = {
    "model_name": "V-JEPA2",
    "video_resolution": (384, 384),
    "num_frames": 64,
    "text_encoder": "BAAI/bge-large-en-v1.5",
    "video_encoder": "facebook/vjepa-vit-giant-p16",
    "freeze_ratio": 0.85,
    "video_dim": 1408,
    "text_dim": 768,
    "hidden_dim": 512,
    "batch_size": 6,
    "learning_rate": 2e-4,
    "epochs": 10,
    "gradient_accumulation_steps": 32,
    "effective_batch_size": 192,
    "discriminative_lr": {
        "text": 0.1,
        "video": 0.5,
        "head": 2.0
    }
}

Component	Memory	Notes
Model Parameters	~4.4 GB	Full model in FP32
Frozen Params (no grad)	~3.7 GB	No gradient storage
Trainable Params (grad)	~1.3 GB	Gradients + params
Single Batch (B=6)	~8 GB	Activations
Optimizer States	~2.6 GB	Adam states for trainable
Total	~16 GB	Peak memory

Memory Breakdown:

Without freezing: Would need ~30GB
With 85% freezing: Only ~16GB
Savings: ~14GB (47% reduction)

Metric	Value
FLOPs (forward)	~280 GFLOPs
Inference Speed	~2 videos/sec (A100)
Training Speed	~12 videos/sec (A100)
Batch Size (A100)	6-8
Effective Batch	192 (32× accumulation)

Gradient Checkpointing

Note that gradient checkpointing is disabled for V-JEPA2: From vjepa_model.py:107-109:

# Disable gradient checkpointing for better gradient flow
if hasattr(self.venc, 'gradient_checkpointing_enable'):
    self.venc.gradient_checkpointing_disable()

Why Disabled? While gradient checkpointing saves memory, it can hurt gradient flow and training stability in large models. Since we already achieve excellent memory efficiency through layer freezing, we prioritize training quality over further memory savings.

Feature Extraction API

Extract video and text features separately:

# Extract features
features = model.extract_features(pixel_values_videos, prompts)
# Returns:
# {
#     'video_features': (B, 1408),
#     'text_features': (B, 768)
# }

# Or predict from pre-extracted features
mos_scores = model.predict_with_features(
    video_features=features['video_features'],
    text_features=features['text_features']
)

Encoder Control

Dynamic freezing/unfreezing:

# Freeze/unfreeze video encoder
model.freeze_video_encoder()    # Freeze all video encoder params
model.unfreeze_video_encoder()  # Unfreeze with 85% strategic freezing

# Freeze/unfreeze text encoder
model.freeze_text_encoder()     # Freeze all text encoder params
model.unfreeze_text_encoder()   # Unfreeze all text encoder params

Advantages

Memory Efficient

85% parameter freezing reduces memory by 47% while maintaining performance

Discriminative Learning

Different learning rates optimize each component independently

Strong Pretrained

ViT-Giant pretrained on massive video datasets provides excellent features

Simple Fusion

Concatenation-based fusion is simple yet effective for video-text alignment

Comparison with DOVER++

Aspect	V-JEPA2	DOVER++
Architecture	Vision Transformer	ConvNeXt 3D CNN
Parameters	1.1B (165M trainable)	120M (all trainable)
Memory	16GB	12GB
Resolution	384×384	640×640
Fusion	Concatenation	Cross-modal attention
Pretrained	Video-specific	Image quality
Strengths	Semantic understanding	Fine-grained quality

DOVER++ Model

Alternative CNN-based architecture

Quality Dimensions

Understanding the 4 quality metrics

Architecture

Overall system architecture

Get Started

Core Concepts

Guides

Overview

Architecture Components

1. Vision-JEPA2 ViT-Giant Backbone

2. Strategic Layer Freezing

3. Text Encoder Integration

4. Optimized MOS Prediction Head

Discriminative Learning Rates

Text Encoder

Video Encoder

Prediction Head

Forward Pass

Technical Specifications

Gradient Checkpointing

Feature Extraction API

Encoder Control

Advantages

Memory Efficient

Discriminative Learning

Strong Pretrained

Simple Fusion

Comparison with DOVER++

DOVER++ Model

Quality Dimensions

Architecture

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Overview

​Architecture Components

​1. Vision-JEPA2 ViT-Giant Backbone

​2. Strategic Layer Freezing

​3. Text Encoder Integration

​4. Optimized MOS Prediction Head

​Discriminative Learning Rates

Text Encoder

Video Encoder

Prediction Head

​Forward Pass

​Technical Specifications

​Gradient Checkpointing

​Feature Extraction API

​Encoder Control

​Advantages

Memory Efficient

Discriminative Learning

Strong Pretrained

Simple Fusion

​Comparison with DOVER++

​Related Topics

DOVER++ Model

Quality Dimensions

Architecture

Build docs developers (and LLMs) love

Overview

Architecture Components

1. Vision-JEPA2 ViT-Giant Backbone

2. Strategic Layer Freezing

3. Text Encoder Integration

4. Optimized MOS Prediction Head

Discriminative Learning Rates

Forward Pass

Technical Specifications

Gradient Checkpointing

Feature Extraction API

Encoder Control

Advantages

Comparison with DOVER++

Related Topics