Skip to main content

Overview

V-JEPA2 (Video Joint Embedding Predictive Architecture v2) is a Vision Transformer-based model specifically designed for video understanding. QualiVision adapts V-JEPA2 with strategic layer freezing (85%) and discriminative learning rates for memory-efficient video quality assessment.
V-JEPA2 Key Stats
  • Parameters: ~1.1 billion (85% frozen)
  • Trainable: ~165 million parameters
  • Input Resolution: 384×384
  • Frames: 64 per video
  • Memory: ~16GB GPU
  • Architecture: ViT-Giant (Vision Transformer)

Architecture Components

1. Vision-JEPA2 ViT-Giant Backbone

The backbone is a massive Vision Transformer pretrained on video data:
From vjepa_model.py:93-102:
# Video encoder - Use FP32 for stable gradients
self.venc = AutoModel.from_pretrained(
    vjepa_model_id,  # "facebook/vjepa2-vitg-fpc64-384-ssv2"
    torch_dtype=torch.float32,
    output_hidden_states=True,
    attn_implementation="sdpa",  # Scaled dot-product attention
)
Key Configuration:
  • Model: facebook/vjepa2-vitg-fpc64-384-ssv2
  • Precision: FP32 for stable gradients (not FP16)
  • Attention: SDPA (PyTorch’s optimized attention)
  • Hidden States: Enabled for feature extraction

2. Strategic Layer Freezing

The most innovative aspect is freezing 85% of the transformer layers to reduce memory and improve training efficiency:
1

Count Total Layers

From vjepa_model.py:140-144:
for name, p in self.venc.named_parameters():
    if "encoder.layer." in name:
        layer_match = name.split("encoder.layer.")[1].split(".")[0]
        if layer_match.isdigit():
            total_layers = max(total_layers, int(layer_match) + 1)
Detects 40 transformer layers in ViT-Giant
2

Determine Freeze Boundary

freeze_until_layer = int(total_layers * self.freeze_ratio)  # 0.85 * 40 = 34
print(f"Freezing layers 0-{freeze_until_layer-1}, training layers {freeze_until_layer}-{total_layers-1}")
# Output: "Freezing layers 0-33, training layers 34-39"
Only the top 6 layers remain trainable
3

Apply Freezing

From vjepa_model.py:152-173:
for name, p in self.venc.named_parameters():
    should_freeze = False
    
    # Always freeze embeddings and pooler
    if "embeddings" in name or "pooler" in name:
        should_freeze = True
    
    # Freeze bottom layers
    elif "encoder.layer." in name:
        layer_match = name.split("encoder.layer.")[1].split(".")[0]
        if layer_match.isdigit():
            layer_num = int(layer_match)
            if layer_num < freeze_until_layer:
                should_freeze = True
    
    # Apply freezing
    if should_freeze:
        p.requires_grad = False
4

Memory Savings

From vjepa_model.py:175-179:
print(f"Layer freezing applied:")
print(f"  Frozen parameters: {frozen_count:,}")
print(f"  Trainable parameters: {trainable_count:,}")
print(f"  Memory savings: ~{(frozen_count/(frozen_count+trainable_count))*100:.0f}% reduction in gradient computation")
Result: ~85% reduction in gradient memory
Why 85% Freezing? The bottom layers learn general visual features (edges, textures, basic shapes) that transfer well across tasks. Only the top layers need fine-tuning for task-specific quality assessment. This dramatically reduces memory while maintaining performance.

3. Text Encoder Integration

BGE-Large encodes text prompts: From vjepa_model.py:112:
self.tenc = SentenceTransformer(text_model_id, device=device)
# "BAAI/bge-large-en-v1.5"
Text Encoding Process:
def encode_text(self, prompts: List[str]) -> torch.Tensor:
    with torch.no_grad():
        text_emb = self.tenc.encode(
            prompts,
            convert_to_tensor=True,
            normalize_embeddings=True,
            device=self.device
        )
    return text_emb  # (B, 768)

4. Optimized MOS Prediction Head

Simple yet effective concatenation-based fusion:
From vjepa_model.py:22-42:
class OptimizedMOSHead(nn.Module):
    def __init__(self, dv: int, dt: int, h: int = 512):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.LayerNorm(dv + dt),  # 1408 + 768 = 2176
            nn.Linear(dv + dt, h),   # -> 512
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(h, h // 2),    # -> 256
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(h // 2, 5)     # -> 5 MOS scores
        )
Key Design Choices:
  • LayerNorm first: Normalizes concatenated features
  • Two-layer MLP: 512 → 256 → 5 dimensions
  • GELU activation: Smooth, modern activation
  • Dropout 0.1: Regularization for generalization

Discriminative Learning Rates

Different parameter groups get different learning rates:

Text Encoder

Lowest LR: 0.1× base rateText embeddings are already well-trained, need minimal adjustment

Video Encoder

Medium LR: 0.5× base rateFine-tune top 6 unfrozen layers moderately

Prediction Head

Highest LR: 2.0× base rateHead needs aggressive training from scratch
From vjepa_model.py:221-243:
def get_discriminative_params(self) -> List[Dict[str, Any]]:
    # Text encoder parameters (lowest lr)
    text_params = list(self.tenc.parameters())
    
    # Video encoder parameters (medium lr)
    video_params = [p for p in self.venc.parameters() if p.requires_grad]
    
    # Prediction head parameters (highest lr)
    head_params = list(self.head.parameters())
    
    param_groups = [
        {'params': text_params, 'name': 'text_encoder'},
        {'params': video_params, 'name': 'video_encoder'},
        {'params': head_params, 'name': 'prediction_head'}
    ]
    
    return param_groups
Usage in Optimizer:
param_groups = model.get_discriminative_params()
optimizer = torch.optim.AdamW([
    {'params': param_groups[0]['params'], 'lr': base_lr * 0.1},  # text
    {'params': param_groups[1]['params'], 'lr': base_lr * 0.5},  # video
    {'params': param_groups[2]['params'], 'lr': base_lr * 2.0},  # head
])
From config.py:52-56:
"discriminative_lr": {
    "text": 0.1,
    "video": 0.5,
    "head": 2.0
}

Forward Pass

Complete inference pipeline:
From vjepa_model.py:193-219:
def forward(self, pixel_values_videos: torch.Tensor, text_emb: torch.Tensor) -> torch.Tensor:
    """
    Args:
        pixel_values_videos: Video tensor (B, C, T, H, W)
        text_emb: Text embeddings (B, text_dim)
        
    Returns:
        MOS predictions (B, 5) - [Traditional, Alignment, Aesthetic, Temporal, Overall]
    """
    # Ensure FP32 for stable gradients
    pixel_values_videos = pixel_values_videos.to(self.venc.device, dtype=torch.float32)
    
    # Forward pass through video encoder
    outputs = self.venc(pixel_values_videos=pixel_values_videos, output_hidden_states=True)
    
    # Get CLS token (first token)
    cls_token = outputs.last_hidden_state[:, 0]  # (B, 1408)
    
    # Ensure text embeddings match video features
    text_emb = text_emb.to(cls_token.device, dtype=cls_token.dtype)
    
    # MOS prediction
    mos_scores = self.head(cls_token, text_emb)
    
    return mos_scores
CLS Token: Following ViT tradition, the first token (CLS token) serves as the video-level representation. This single 1408-dimensional vector encodes the entire video’s semantic content.

Technical Specifications

ParameterValue
Total Parameters~1.1B
Frozen Parameters~935M (85%)
Trainable Parameters~165M (15%)
Input Resolution384×384
Input Frames64
Video Feature Dim1408
Text Feature Dim768
Hidden Dimension512
Output Dimension5

Gradient Checkpointing

Note that gradient checkpointing is disabled for V-JEPA2: From vjepa_model.py:107-109:
# Disable gradient checkpointing for better gradient flow
if hasattr(self.venc, 'gradient_checkpointing_enable'):
    self.venc.gradient_checkpointing_disable()
Why Disabled? While gradient checkpointing saves memory, it can hurt gradient flow and training stability in large models. Since we already achieve excellent memory efficiency through layer freezing, we prioritize training quality over further memory savings.

Feature Extraction API

Extract video and text features separately:
# Extract features
features = model.extract_features(pixel_values_videos, prompts)
# Returns:
# {
#     'video_features': (B, 1408),
#     'text_features': (B, 768)
# }

# Or predict from pre-extracted features
mos_scores = model.predict_with_features(
    video_features=features['video_features'],
    text_features=features['text_features']
)

Encoder Control

Dynamic freezing/unfreezing:
# Freeze/unfreeze video encoder
model.freeze_video_encoder()    # Freeze all video encoder params
model.unfreeze_video_encoder()  # Unfreeze with 85% strategic freezing

# Freeze/unfreeze text encoder
model.freeze_text_encoder()     # Freeze all text encoder params
model.unfreeze_text_encoder()   # Unfreeze all text encoder params

Advantages

Memory Efficient

85% parameter freezing reduces memory by 47% while maintaining performance

Discriminative Learning

Different learning rates optimize each component independently

Strong Pretrained

ViT-Giant pretrained on massive video datasets provides excellent features

Simple Fusion

Concatenation-based fusion is simple yet effective for video-text alignment

Comparison with DOVER++

AspectV-JEPA2DOVER++
ArchitectureVision TransformerConvNeXt 3D CNN
Parameters1.1B (165M trainable)120M (all trainable)
Memory16GB12GB
Resolution384×384640×640
FusionConcatenationCross-modal attention
PretrainedVideo-specificImage quality
StrengthsSemantic understandingFine-grained quality

DOVER++ Model

Alternative CNN-based architecture

Quality Dimensions

Understanding the 4 quality metrics

Architecture

Overall system architecture

Build docs developers (and LLMs) love