V-JEPA2 (Video Joint Embedding Predictive Architecture v2) is a Vision Transformer-based model specifically designed for video understanding. QualiVision adapts V-JEPA2 with strategic layer freezing (85%) and discriminative learning rates for memory-efficient video quality assessment.
The most innovative aspect is freezing 85% of the transformer layers to reduce memory and improve training efficiency:
1
Count Total Layers
From vjepa_model.py:140-144:
for name, p in self.venc.named_parameters(): if "encoder.layer." in name: layer_match = name.split("encoder.layer.")[1].split(".")[0] if layer_match.isdigit(): total_layers = max(total_layers, int(layer_match) + 1)
for name, p in self.venc.named_parameters(): should_freeze = False # Always freeze embeddings and pooler if "embeddings" in name or "pooler" in name: should_freeze = True # Freeze bottom layers elif "encoder.layer." in name: layer_match = name.split("encoder.layer.")[1].split(".")[0] if layer_match.isdigit(): layer_num = int(layer_match) if layer_num < freeze_until_layer: should_freeze = True # Apply freezing if should_freeze: p.requires_grad = False
Why 85% Freezing? The bottom layers learn general visual features (edges, textures, basic shapes) that transfer well across tasks. Only the top layers need fine-tuning for task-specific quality assessment. This dramatically reduces memory while maintaining performance.
def forward(self, pixel_values_videos: torch.Tensor, text_emb: torch.Tensor) -> torch.Tensor: """ Args: pixel_values_videos: Video tensor (B, C, T, H, W) text_emb: Text embeddings (B, text_dim) Returns: MOS predictions (B, 5) - [Traditional, Alignment, Aesthetic, Temporal, Overall] """ # Ensure FP32 for stable gradients pixel_values_videos = pixel_values_videos.to(self.venc.device, dtype=torch.float32) # Forward pass through video encoder outputs = self.venc(pixel_values_videos=pixel_values_videos, output_hidden_states=True) # Get CLS token (first token) cls_token = outputs.last_hidden_state[:, 0] # (B, 1408) # Ensure text embeddings match video features text_emb = text_emb.to(cls_token.device, dtype=cls_token.dtype) # MOS prediction mos_scores = self.head(cls_token, text_emb) return mos_scores
CLS Token: Following ViT tradition, the first token (CLS token) serves as the video-level representation. This single 1408-dimensional vector encodes the entire video’s semantic content.
Why Disabled? While gradient checkpointing saves memory, it can hurt gradient flow and training stability in large models. Since we already achieve excellent memory efficiency through layer freezing, we prioritize training quality over further memory savings.
# Freeze/unfreeze video encodermodel.freeze_video_encoder() # Freeze all video encoder paramsmodel.unfreeze_video_encoder() # Unfreeze with 85% strategic freezing# Freeze/unfreeze text encodermodel.freeze_text_encoder() # Freeze all text encoder paramsmodel.unfreeze_text_encoder() # Unfreeze all text encoder params