Memory Optimization

Overview

QualiVision includes multiple memory optimization strategies to enable training on consumer GPUs. This guide covers batch size tuning, gradient accumulation, layer freezing, and memory cleanup utilities.

Quick Memory Settings

GPU Memory Requirements

GPU	VRAM	DOVER++ Batch Size	V-JEPA2 Batch Size
RTX 3090	24GB	2-3	2
RTX 4090	24GB	3-4	2-3
V100	32GB	4	4
A100 (40GB)	40GB	4-6	6
A100 (80GB)	80GB	8+	8+

These are conservative estimates. Your actual capacity may vary based on system overhead and other processes.

Batch Size Adjustment

Command-Line Configuration

The simplest optimization is adjusting batch size:

python scripts/train.py \
    --model dover \
    --data data/ \
    --batch-size 2 \
    --epochs 5

Default Batch Sizes

Configured in src/config/config.py:21:

DOVER_CONFIG = {
    "batch_size": 4,
    "gradient_accumulation_steps": 8,
    "effective_batch_size": 32  # 4 × 8
}

Gradient Accumulation

How It Works

Gradient accumulation allows simulating larger batch sizes without increased memory:

# From src/utils/training.py:210
accumulation_steps = 8  # DOVER++ default

for i, batch in enumerate(train_loader):
    # Forward pass
    loss = loss / accumulation_steps
    scaler.scale(loss).backward()
    
    # Update every N batches
    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        optimizer.zero_grad()

Benefits:

Maintains effective batch size for stable training
Reduces memory usage
No change to convergence behavior

Trade-offs:

Slower training (more iterations per update)
Same statistical properties as larger batches

Configuration

Gradient accumulation is automatic but can be configured in src/config/config.py:33:

DOVER_CONFIG = {
    "gradient_accumulation_steps": 8,  # Adjust this
    # ...
}

Recommended values:

Batch size 4: 8 steps (effective: 32)
Batch size 2: 16 steps (effective: 32)
Batch size 1: 32 steps (effective: 32)

Keep effective_batch_size = batch_size × gradient_accumulation_steps constant for consistent training dynamics.

Strategic Layer Freezing

V-JEPA2 Freezing Strategy

V-JEPA2 uses aggressive layer freezing to reduce memory and enable efficient fine-tuning (src/config/config.py:43):

VJEPA_CONFIG = {
    "freeze_ratio": 0.85,  # Freeze bottom 85% of layers
    # ...
}

Implementation (from model architecture):

# Freeze bottom 85% of vision transformer layers
num_layers = len(model.encoder.layers)
freeze_until = int(num_layers * 0.85)

for i, layer in enumerate(model.encoder.layers):
    if i < freeze_until:
        for param in layer.parameters():
            param.requires_grad = False

Memory Savings:

Full model: ~1.1B parameters, ~16GB
With freezing: ~165M trainable, ~10GB
Reduction: ~37% memory savings

Performance Impact:

Minimal degradation on fine-tuning tasks
Faster training (fewer gradients to compute)
Better generalization (prevents overfitting)

Custom Freezing Ratios

Adjust the freeze ratio for your use case:

Freeze Ratio	Trainable Params	Memory Usage	Use Case
0.95	~55M	~8GB	Very limited GPU
0.90	~110M	~9GB	Standard fine-tuning
0.85	~165M	~10GB	Default (recommended)
0.75	~275M	~12GB	More adaptation needed
0.50	~550M	~14GB	Significant domain shift

Freeze ratios below 0.75 may increase overfitting risk on small datasets.

Mixed Precision Training

Automatic Mixed Precision (AMP)

Enabled by default (src/config/config.py:62):

TRAINING_CONFIG = {
    "mixed_precision": True,
    # ...
}

Implementation (from src/utils/training.py:255):

from torch.cuda.amp import GradScaler

scaler = GradScaler()

# Training loop
with torch.cuda.amp.autocast():
    outputs = model(batch['pixel_values_videos'], batch['prompts'])
    loss = loss_fn(outputs, batch['labels'])

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Benefits:

~40% memory reduction from FP16 tensors
~2x faster computation on Tensor Core GPUs (V100, A100, RTX series)
Minimal accuracy impact with gradient scaling

Memory Breakdown:

Component	FP32	FP16 (AMP)	Savings
Model weights	8GB	4GB	50%
Activations	6GB	3GB	50%
Gradients	8GB	4GB	50%
Optimizer	16GB	12GB	25%
Total	38GB	23GB	~40%

Memory Cleanup Utilities

Ultra Memory Cleanup

Aggressive memory cleanup function (src/utils/memory.py:12):

def ultra_memory_cleanup():
    """Aggressive GPU memory cleanup."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

Usage in training loop (src/utils/training.py:302):

for i, batch in enumerate(train_loader):
    # Training step...
    
    # Cleanup every 10 batches
    if i % 10 == 0:
        ultra_memory_cleanup()

When to use:

After large tensor operations
Before evaluation/inference
When approaching memory limits
After checkpoint saving

Memory Monitoring

Track memory usage during training (src/utils/memory.py:20):

def get_gpu_memory_info() -> dict:
    """Get current GPU memory usage information."""
    if not torch.cuda.is_available():
        return {'error': 'CUDA not available'}
    
    allocated = torch.cuda.memory_allocated()
    reserved = torch.cuda.memory_reserved()
    max_allocated = torch.cuda.max_memory_allocated()
    max_reserved = torch.cuda.max_memory_reserved()
    
    return {
        'allocated_gb': allocated / 1e9,
        'reserved_gb': reserved / 1e9,
        'max_allocated_gb': max_allocated / 1e9,
        'max_reserved_gb': max_reserved / 1e9,
        'free_gb': (torch.cuda.get_device_properties(0).total_memory - allocated) / 1e9
    }

Print memory usage (src/utils/memory.py:39):

def print_gpu_memory():
    """Print current GPU memory usage."""
    info = get_gpu_memory_info()
    print(f"GPU Memory - Allocated: {info['allocated_gb']:.1f}GB, "
          f"Free: {info['free_gb']:.1f}GB, "
          f"Max Used: {info['max_allocated_gb']:.1f}GB")

Example output:

GPU Memory - Allocated: 8.2GB, Free: 15.8GB, Max Used: 11.4GB

Memory Monitoring Context Manager

Track memory changes for operations (src/utils/memory.py:51):

class MemoryMonitor:
    """Context manager for monitoring memory usage."""
    
    def __init__(self, name: str = "Operation"):
        self.name = name
    
    def __enter__(self):
        self.start_allocated = torch.cuda.memory_allocated()
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        end_allocated = torch.cuda.memory_allocated()
        diff = (end_allocated - self.start_allocated) / 1e9
        print(f"{self.name} - Memory change: {diff:+.2f}GB")

Usage:

from src.utils.memory import MemoryMonitor

# Track model loading
with MemoryMonitor("Model Loading"):
    model = DOVERModel(...)

# Track forward pass
with MemoryMonitor("Forward Pass"):
    outputs = model(videos, prompts)

Advanced Optimization Strategies

1. Reduce Video Resolution

For extremely limited memory, reduce input resolution:

# In src/config/config.py
DOVER_CONFIG = {
    "video_resolution": (512, 512),  # Instead of (640, 640)
    # ...
}

VJEPA_CONFIG = {
    "video_resolution": (256, 256),  # Instead of (384, 384)
    # ...
}

Trade-offs:

✅ Significant memory savings (~30%)
❌ Reduced model accuracy (~2-5%)
✅ Faster training

2. Reduce Frame Count

Use fewer frames per video:

# In src/config/config.py
DOVER_CONFIG = {
    "num_frames": 32,  # Instead of 64
    # ...
}

Trade-offs:

✅ 50% memory reduction for temporal features
❌ May miss temporal artifacts
❌ Lower temporal consistency scores

3. Gradient Checkpointing

Gradient checkpointing is disabled by default for V-JEPA2 as it conflicts with layer freezing (src/config/config.py:143).

For DOVER++, enable if needed:

# In model initialization
model.encoder.gradient_checkpointing_enable()

Trade-offs:

✅ ~30% memory savings
❌ ~20% slower training (recomputation overhead)

4. Data Loading Optimization

Optimize data pipeline (src/config/config.py:73):

TRAINING_CONFIG = {
    "num_workers": 4,           # Parallel data loading
    "pin_memory": True,         # Faster GPU transfer
    "persistent_workers": True, # Reuse worker processes
}

Benefits:

Reduces training time bottlenecks
Keeps GPU utilized
No memory overhead

Out-of-Memory Handling

Automatic OOM Recovery

The training loop includes OOM handling (src/utils/training.py:314):

try:
    loss = loss_fn(outputs, labels)
    loss.backward()
except RuntimeError as e:
    if "out of memory" in str(e).lower():
        print(f"⚠ OOM at batch {i}, skipping...")
        optimizer.zero_grad(set_to_none=True)
        ultra_memory_cleanup()
        continue
    else:
        raise e

Behavior:

Detects OOM errors
Clears gradients
Runs aggressive cleanup
Skips batch and continues
Logs warning

Manual Intervention

If training frequently hits OOM:

Reduce Batch Size

python scripts/train.py --model dover --batch-size 1 --data data/

Increase Gradient Accumulation

Edit src/config/config.py:33 to increase gradient_accumulation_steps.

Monitor Peak Memory

from src.utils.memory import print_gpu_memory

# After each epoch
print_gpu_memory()
torch.cuda.reset_peak_memory_stats()

Consider Model Freezing

For V-JEPA2, increase freeze ratio to 0.90 or 0.95.

Memory Usage Patterns

Training vs Evaluation

Phase	Memory Usage	Why
Training	100%	Gradients + optimizer states
Validation	~60%	No gradients, smaller batches
Inference	~40%	No gradients, batch size 1

Memory Timeline

Model Loading:  ████░░░░░░ 40% (weights only)
First Forward:  ████████░░ 80% (activations allocated)
First Backward: ██████████ 100% (gradients computed)
Optimizer Step: ██████████ 100% (optimizer states)
Cleanup:        ████░░░░░░ 40% (cache cleared)

Monitoring Commands

Monitor GPU usage during training:

watch -n 1 nvidia-smi

Best Practices Summary

Start with Default Settings

Use model defaults and see if they fit in memory.

Reduce Batch Size First

Halve batch size before other optimizations.

Maintain Effective Batch Size

Adjust gradient accumulation to compensate.

Monitor Memory Usage

Use print_gpu_memory() to track allocation.

Enable All Optimizations

Mixed precision (enabled by default)
Layer freezing (V-JEPA2)
Memory cleanup (automatic)