Overview
V-JEPA2 is a state-of-the-art video encoder based on Vision Transformers (ViT-Giant) with strategic layer freezing for efficient fine-tuning. The implementation includes discriminative learning rates, optimized MOS prediction, and BGE-Large text encoding.VJEPAModel
The main V-JEPA2 model class with strategic layer freezing and discriminative learning.Constructor
HuggingFace model ID for the V-JEPA2 video encoder. Uses ViT-Giant with 64 frames per clip at 384 resolution.
HuggingFace model ID for the text encoder. Uses BGE-Large for high-quality text embeddings.
Ratio of transformer layers to freeze (0.85 = freeze bottom 85%). Reduces memory usage and improves training stability.
Device to place the model on (‘cuda’ or ‘cpu’).
Methods
forward
Forward pass through the V-JEPA2 model with video and text inputs.Video tensor with shape (B, C, T, H, W) where:
- B = batch size
- C = channels (3 for RGB)
- T = number of frames (typically 64)
- H = height (384)
- W = width (384)
Pre-computed text embeddings with shape (B, text_dim).
MOS predictions with shape (B, 5) containing [Traditional, Alignment, Aesthetic, Temporal, Overall] quality scores.
encode_text
Encode text prompts using the BGE-Large text encoder.List of text prompts to encode.
Text embeddings with shape (B, text_dim). Embeddings are normalized.
encode_video
Encode video using the V-JEPA2 ViT-Giant encoder.Video tensor with shape (B, C, T, H, W).
Video features (CLS token) with shape (B, hidden_size).
extract_features
Extract both video and text features without MOS prediction.Video tensor with shape (B, C, T, H, W).
List of text prompts corresponding to each video.
Dictionary containing:
video_features: Video embeddings (B, hidden_size)text_features: Text embeddings (B, text_dim)
get_discriminative_params
Get parameter groups for discriminative learning rates.List of parameter groups:
text_encoder: Text encoder parameters (lowest LR)video_encoder: Video encoder parameters (medium LR)prediction_head: MOS prediction head parameters (highest LR)
freeze_video_encoder
Freeze the entire video encoder.unfreeze_video_encoder
Unfreeze the video encoder while respecting strategic freezing.freeze_text_encoder
Freeze the text encoder.unfreeze_text_encoder
Unfreeze the text encoder.Usage Example
Training with Discriminative Learning Rates
OptimizedMOSHead
Optimized MOS prediction head that combines video and text features.Constructor
Video feature dimension (hidden_size from V-JEPA2).
Text feature dimension (embedding dimension from text encoder).
Hidden dimension for MLP layers.
Methods
forward
Combine video and text features to predict MOS scores.Video features with shape (B, dv).
Text features with shape (B, dt).
MOS predictions with shape (B, 5) containing [Traditional, Alignment, Aesthetic, Temporal, Overall] scores.
Usage Example
Helper Functions
create_vjepa_model
Factory function to create V-JEPA2 model from configuration dictionary.Configuration dictionary with keys:
video_encoder: V-JEPA2 model ID (default: “facebook/vjepa2-vitg-fpc64-384-ssv2”)text_encoder: Text encoder model ID (default: “BAAI/bge-large-en-v1.5”)freeze_ratio: Layer freeze ratio (default: 0.85)device: Device to use (default: “cuda”)
Initialized V-JEPA2 model.
Usage Example
Architecture Details
V-JEPA2 ViT-Giant
The model uses a Vision Transformer Giant (ViT-G) architecture with:- Patch Size: 16×16
- Frame Configuration: 64 frames per clip
- Resolution: 384×384
- Hidden Size: 1408
- Attention: Scaled dot-product attention (SDPA)
- Precision: FP32 for stable gradients
Strategic Layer Freezing
The model implements strategic freezing to reduce memory and improve training:-
Always Frozen:
- Embedding layers
- Pooler layers
-
Conditionally Frozen (based on
freeze_ratio):- Bottom 85% of transformer layers (default)
- Top 15% remain trainable
-
Benefits:
- ~85% reduction in gradient computation
- Lower memory usage
- Faster training
- Better generalization
Discriminative Learning Rates
Recommended learning rate hierarchy:- Minimal adaptation of pretrained encoders
- Faster learning for task-specific head
- Better stability during training
Model Statistics
For the default configuration:- Total Parameters: ~1.9B
- Trainable Parameters: ~300M (with 85% freezing)
- Model Size: ~7.6 GB
- Trainable Size: ~1.2 GB
- Memory Savings: ~85% reduction in gradient computation
Output Format
All MOS predictions follow the format (B, 5):- Index 0: Traditional Quality (blur, noise, compression)
- Index 1: Alignment (video-text semantic alignment)
- Index 2: Aesthetic Quality (visual appeal)
- Index 3: Temporal Consistency (smoothness across frames)
- Index 4: Overall Quality (weighted combination)