Overview
This guide covers training both DOVER++ and V-JEPA2 models for video quality assessment. Both models support fine-tuning on custom datasets with comprehensive configuration options.Data Preparation
Before training, ensure your data follows the required structure:Prepare CSV Files
Ensure your CSV files contain the required columns:All MOS scores should be on a 1-5 scale.
Training Commands
DOVER++ Model
- Resolution: 640×640
- Frames: 64 per video
- Default batch size: 4
- Default learning rate: 1e-4
- Gradient accumulation: 8 steps (effective batch size: 32)
- Text encoder: BAAI/bge-large-en-v1.5
V-JEPA2 Model
- Resolution: 384×384
- Frames: 64 per video
- Default batch size: 6
- Default learning rate: 2e-4
- Gradient accumulation: 32 steps (effective batch size: 192)
- Freeze ratio: 0.85 (85% of layers frozen)
- Video encoder: facebook/vjepa-vit-giant-p16
Configuration Options
Command-Line Arguments
| Argument | Description | Default | Required |
|---|---|---|---|
--model | Model type: dover or vjepa | - | Yes |
--data | Path to data directory | - | Yes |
--epochs | Number of training epochs | 5 | No |
--batch-size | Batch size per GPU | Model default | No |
--lr | Learning rate | Model default | No |
--output | Output directory for checkpoints | models | No |
--resume | Resume from checkpoint path | - | No |
--wandb | Enable Weights & Biases logging | False | No |
Model-Specific Configuration
Configuration is defined insrc/config/config.py:21:
Monitoring Training
Weights & Biases Integration
Enable W&B logging with the--wandb flag (scripts/train.py:62):
train_loss: Training loss per epochval_loss: Validation loss per epochvquala_score: VQualA challenge score (SROCC + PLCC) / 2best_score: Best validation score achieved- Learning rate schedule
- Gradient norms
Console Output
The training script provides real-time progress:Loss Components
The hybrid loss function (src/utils/training.py:90) combines three components:- Smooth L1 Loss (β=0.1): Basic regression loss
- Ranking Loss (margin=0.2): Preserves relative quality ordering
- Scale-Aware Loss: Emphasizes extreme quality values
Expected Training Times
DOVER++ Model
| GPU | Batch Size | Time per Epoch | Total (5 epochs) |
|---|---|---|---|
| A100 (40GB) | 4 | ~45 min | ~3.75 hours |
| V100 (32GB) | 4 | ~60 min | ~5 hours |
| RTX 3090 (24GB) | 2 | ~90 min | ~7.5 hours |
V-JEPA2 Model
| GPU | Batch Size | Time per Epoch | Total (10 epochs) |
|---|---|---|---|
| A100 (40GB) | 6 | ~75 min | ~12.5 hours |
| V100 (32GB) | 4 | ~90 min | ~15 hours |
| RTX 3090 (24GB) | 2 | ~120 min | ~20 hours |
Resource Requirements
GPU Memory
DOVER++:- Minimum: 12GB VRAM (batch size 2)
- Recommended: 24GB VRAM (batch size 4)
- Parameters: ~120M
- Minimum: 16GB VRAM (batch size 2)
- Recommended: 40GB VRAM (batch size 6)
- Parameters: ~1.1B (only 15% trainable due to freezing)
Storage
- Model checkpoints: ~500MB per checkpoint
- Training logs: ~10MB per run
- Cache files: ~2GB for text embeddings
Checkpoint Management
Checkpoints are automatically saved (scripts/train.py:139):models/{model}_best.pt
Contents:
- Model weights
- Optimizer state
- Best validation score
- Model configuration
- Current epoch
Troubleshooting
Out of Memory
Reduce batch size:Slow Training
- Check data loading: Ensure videos are on fast storage (SSD)
- Increase workers: Set
num_workers=4insrc/config/config.py:73 - Enable mixed precision: Enabled by default (src/config/config.py:62)
NaN Loss
Reduce learning rate:Next Steps
Evaluation
Evaluate your trained models
Memory Optimization
Optimize GPU memory usage