V-JEPA2

Overview

V-JEPA2 is a state-of-the-art video encoder based on Vision Transformers (ViT-Giant) with strategic layer freezing for efficient fine-tuning. The implementation includes discriminative learning rates, optimized MOS prediction, and BGE-Large text encoding.

VJEPAModel

The main V-JEPA2 model class with strategic layer freezing and discriminative learning.

Constructor

VJEPAModel(
    vjepa_model_id: str = "facebook/vjepa2-vitg-fpc64-384-ssv2",
    text_model_id: str = "BAAI/bge-large-en-v1.5",
    freeze_ratio: float = 0.85,
    device: str = 'cuda'
)

vjepa_model_id

str

default:"facebook/vjepa2-vitg-fpc64-384-ssv2"

HuggingFace model ID for the V-JEPA2 video encoder. Uses ViT-Giant with 64 frames per clip at 384 resolution.

text_model_id

str

default:"BAAI/bge-large-en-v1.5"

HuggingFace model ID for the text encoder. Uses BGE-Large for high-quality text embeddings.

freeze_ratio

float

default:"0.85"

Ratio of transformer layers to freeze (0.85 = freeze bottom 85%). Reduces memory usage and improves training stability.

device

str

default:"cuda"

Device to place the model on (‘cuda’ or ‘cpu’).

Methods

forward

Forward pass through the V-JEPA2 model with video and text inputs.

forward(
    pixel_values_videos: torch.Tensor,
    text_emb: torch.Tensor
) -> torch.Tensor

pixel_values_videos

torch.Tensor

required

Video tensor with shape (B, C, T, H, W) where:

B = batch size
C = channels (3 for RGB)
T = number of frames (typically 64)
H = height (384)
W = width (384)

text_emb

torch.Tensor

required

Pre-computed text embeddings with shape (B, text_dim).

return

torch.Tensor

MOS predictions with shape (B, 5) containing [Traditional, Alignment, Aesthetic, Temporal, Overall] quality scores.

encode_text

Encode text prompts using the BGE-Large text encoder.

encode_text(
    prompts: List[str]
) -> torch.Tensor

prompts

List[str]

required

List of text prompts to encode.

return

torch.Tensor

Text embeddings with shape (B, text_dim). Embeddings are normalized.

encode_video

Encode video using the V-JEPA2 ViT-Giant encoder.

encode_video(
    pixel_values_videos: torch.Tensor
) -> torch.Tensor

pixel_values_videos

torch.Tensor

required

Video tensor with shape (B, C, T, H, W).

return

torch.Tensor

Video features (CLS token) with shape (B, hidden_size).

extract_features

Extract both video and text features without MOS prediction.

extract_features(
    pixel_values_videos: torch.Tensor,
    prompts: List[str]
) -> Dict[str, torch.Tensor]

pixel_values_videos

torch.Tensor

required

Video tensor with shape (B, C, T, H, W).

prompts

List[str]

required

List of text prompts corresponding to each video.

return

Dict[str, torch.Tensor]

Dictionary containing:

video_features: Video embeddings (B, hidden_size)
text_features: Text embeddings (B, text_dim)

get_discriminative_params

Get parameter groups for discriminative learning rates.

get_discriminative_params() -> List[Dict[str, Any]]

return

List[Dict[str, Any]]

List of parameter groups:

text_encoder: Text encoder parameters (lowest LR)
video_encoder: Video encoder parameters (medium LR)
prediction_head: MOS prediction head parameters (highest LR)

freeze_video_encoder

Freeze the entire video encoder.

freeze_video_encoder() -> None

unfreeze_video_encoder

Unfreeze the video encoder while respecting strategic freezing.

unfreeze_video_encoder() -> None

freeze_text_encoder

Freeze the text encoder.

freeze_text_encoder() -> None

unfreeze_text_encoder

Unfreeze the text encoder.

unfreeze_text_encoder() -> None

Usage Example

import torch
from src.models.vjepa_model import VJEPAModel

# Initialize model with 85% layer freezing
model = VJEPAModel(
    vjepa_model_id="facebook/vjepa2-vitg-fpc64-384-ssv2",
    text_model_id="BAAI/bge-large-en-v1.5",
    freeze_ratio=0.85,
    device='cuda'
)
model.eval()

# Prepare inputs
video = torch.randn(2, 3, 64, 384, 384).cuda()
prompts = [
    "A person dancing in a studio",
    "Birds flying in the sky"
]

# Encode text
text_emb = model.encode_text(prompts)

# Get MOS predictions
with torch.no_grad():
    mos_scores = model(video, text_emb)
    print(f"MOS scores: {mos_scores}")  # Shape: (2, 5)

# Extract features separately
features = model.extract_features(video, prompts)
print(f"Video features: {features['video_features'].shape}")
print(f"Text features: {features['text_features'].shape}")

# Get discriminative parameter groups for optimizer
param_groups = model.get_discriminative_params()
for group in param_groups:
    print(f"{group['name']}: {len(list(group['params']))} params")

Training with Discriminative Learning Rates

import torch.optim as optim
from src.models.vjepa_model import VJEPAModel

# Initialize model
model = VJEPAModel(freeze_ratio=0.85, device='cuda')

# Get parameter groups
param_groups = model.get_discriminative_params()

# Set different learning rates for each group
optimizer = optim.AdamW([
    {'params': param_groups[0]['params'], 'lr': 1e-5},  # text_encoder
    {'params': param_groups[1]['params'], 'lr': 5e-5},  # video_encoder
    {'params': param_groups[2]['params'], 'lr': 1e-4},  # prediction_head
], weight_decay=0.01)

print("Optimizer configured with discriminative learning rates")

OptimizedMOSHead

Optimized MOS prediction head that combines video and text features.

Constructor

OptimizedMOSHead(
    dv: int,
    dt: int,
    h: int = 512
)

int

required

Video feature dimension (hidden_size from V-JEPA2).

int

required

Text feature dimension (embedding dimension from text encoder).

int

default:"512"

Hidden dimension for MLP layers.

Methods

forward

Combine video and text features to predict MOS scores.

forward(
    v: torch.Tensor,
    t: torch.Tensor
) -> torch.Tensor

torch.Tensor

required

Video features with shape (B, dv).

torch.Tensor

required

Text features with shape (B, dt).

return

torch.Tensor

MOS predictions with shape (B, 5) containing [Traditional, Alignment, Aesthetic, Temporal, Overall] scores.

Usage Example

import torch
from src.models.vjepa_model import OptimizedMOSHead

# Initialize MOS head
mos_head = OptimizedMOSHead(
    dv=1408,  # V-JEPA2 ViT-Giant hidden size
    dt=1024,  # BGE-Large embedding size
    h=512
).cuda()

# Create sample features
video_features = torch.randn(4, 1408).cuda()
text_features = torch.randn(4, 1024).cuda()

# Predict MOS scores
mos_scores = mos_head(video_features, text_features)
print(f"MOS scores: {mos_scores}")  # Shape: (4, 5)

Helper Functions

create_vjepa_model

Factory function to create V-JEPA2 model from configuration dictionary.

create_vjepa_model(
    config: Dict[str, Any]
) -> VJEPAModel

config

Dict[str, Any]

required

Configuration dictionary with keys:

video_encoder: V-JEPA2 model ID (default: “facebook/vjepa2-vitg-fpc64-384-ssv2”)
text_encoder: Text encoder model ID (default: “BAAI/bge-large-en-v1.5”)
freeze_ratio: Layer freeze ratio (default: 0.85)
device: Device to use (default: “cuda”)

return

VJEPAModel

Initialized V-JEPA2 model.

Usage Example

from src.models.vjepa_model import create_vjepa_model

config = {
    'video_encoder': 'facebook/vjepa2-vitg-fpc64-384-ssv2',
    'text_encoder': 'BAAI/bge-large-en-v1.5',
    'freeze_ratio': 0.85,
    'device': 'cuda'
}

model = create_vjepa_model(config)

Architecture Details

V-JEPA2 ViT-Giant

The model uses a Vision Transformer Giant (ViT-G) architecture with:

Patch Size: 16×16
Frame Configuration: 64 frames per clip
Resolution: 384×384
Hidden Size: 1408
Attention: Scaled dot-product attention (SDPA)
Precision: FP32 for stable gradients

Strategic Layer Freezing

The model implements strategic freezing to reduce memory and improve training:

Always Frozen:
- Embedding layers
- Pooler layers
Conditionally Frozen (based on freeze_ratio):
- Bottom 85% of transformer layers (default)
- Top 15% remain trainable
Benefits:
- ~85% reduction in gradient computation
- Lower memory usage
- Faster training
- Better generalization

Discriminative Learning Rates

Recommended learning rate hierarchy:

{
    'text_encoder': 1e-5,      # Lowest LR
    'video_encoder': 5e-5,     # Medium LR
    'prediction_head': 1e-4    # Highest LR
}

This approach allows:

Minimal adaptation of pretrained encoders
Faster learning for task-specific head
Better stability during training

Model Statistics

For the default configuration:

Total Parameters: ~1.9B
Trainable Parameters: ~300M (with 85% freezing)
Model Size: ~7.6 GB
Trainable Size: ~1.2 GB
Memory Savings: ~85% reduction in gradient computation

Output Format

All MOS predictions follow the format (B, 5):

Index 0: Traditional Quality (blur, noise, compression)
Index 1: Alignment (video-text semantic alignment)
Index 2: Aesthetic Quality (visual appeal)
Index 3: Temporal Consistency (smoothness across frames)
Index 4: Overall Quality (weighted combination)

Models

Utilities

Configuration

Overview

VJEPAModel

Constructor

Methods

forward

encode_text

encode_video

extract_features

get_discriminative_params

freeze_video_encoder

unfreeze_video_encoder

freeze_text_encoder

unfreeze_text_encoder

Usage Example

Training with Discriminative Learning Rates

OptimizedMOSHead

Constructor

Methods

forward

Usage Example

Helper Functions

create_vjepa_model

Usage Example

Architecture Details

V-JEPA2 ViT-Giant

Strategic Layer Freezing

Discriminative Learning Rates

Model Statistics

Output Format

Build docs developers (and LLMs) love

Models

Utilities

Configuration

​Overview

​VJEPAModel

​Constructor

​Methods

​forward

​encode_text

​encode_video

​extract_features

​get_discriminative_params

​freeze_video_encoder

​unfreeze_video_encoder

​freeze_text_encoder

​unfreeze_text_encoder

​Usage Example

​Training with Discriminative Learning Rates

​OptimizedMOSHead

​Constructor

​Methods

​forward

​Usage Example

​Helper Functions

​create_vjepa_model

​Usage Example

​Architecture Details

​V-JEPA2 ViT-Giant

​Strategic Layer Freezing

​Discriminative Learning Rates

​Model Statistics

​Output Format

Build docs developers (and LLMs) love

Overview

VJEPAModel

Constructor

Methods

forward

encode_text

encode_video

extract_features

get_discriminative_params

freeze_video_encoder

unfreeze_video_encoder

freeze_text_encoder

unfreeze_text_encoder

Usage Example

Training with Discriminative Learning Rates

OptimizedMOSHead

Constructor

Methods

forward

Usage Example

Helper Functions

create_vjepa_model

Usage Example

Architecture Details

V-JEPA2 ViT-Giant

Strategic Layer Freezing

Discriminative Learning Rates

Model Statistics

Output Format