Overview
QualiVision implements a sophisticated data preprocessing pipeline that handles video loading, frame sampling, resolution adaptation, text encoding, and MOS score normalization. The pipeline is optimized for GPU acceleration and supports both DOVER++ and V-JEPA2 models.Dataset Structure
FromREADME.md:19-43, the expected directory structure:
CSV Label Format
FromREADME.md:46-50 and dataset.py:37:
video_name: Video filenamePrompt: Text description/promptTraditional_MOS: Technical quality score (1-5)Alignment_MOS: Text-video alignment score (1-5)Aesthetic_MOS: Aesthetic appeal score (1-5)Temporal_MOS: Temporal consistency score (1-5)Overall_MOS: Overall quality score (1-5)
Frame Sampling
All videos are uniformly sampled to 64 frames:- Uniform Sampling
- Short Videos
- Long Videos
- Why 64 Frames?
From Process:
dataset.py:82-98:- Count total frames in video
- Generate 64 evenly-spaced indices using
np.linspace - Clip indices to valid range [0, total_frames-1]
- Extract frames using Decord’s batch API
Resolution Adaptation
Different models require different input resolutions:- DOVER++ (640×640)
- V-JEPA2 (384×384)
- Why Different Resolutions?
From Process per frame:
dataset.py:64-69 and config.py:23:- Convert numpy array → PIL Image
- Resize to 640×640 (bicubic interpolation by default)
- Convert to PyTorch tensor (C, H, W)
- Normalize to [0, 1] range
(B, C=3, T=64, H=640, W=640)Text Processing with BGE-Large
Text prompts are encoded to dense embeddings:- Text Encoder
- Encoding Process
- Why BGE-Large?
- Prompt Examples
From BGE-Large (BAAI General Embedding):
config.py:25,41:- Model: Sentence Transformer
- Architecture: BERT-based encoder
- Dimension: 1024 (DOVER++) or 768 (V-JEPA2)
- Training: Contrastive learning on text pairs
MOS Score Normalization
Fromdataset.py:182-189:
- Normalization Process
- Missing Values
- Test Mode
Steps:
- Parse: Convert CSV strings to numeric values
- Handle missing: Fill NaN with 3.0 (neutral score)
- Cast: Convert to float32 for GPU efficiency
- Tensorize: Create PyTorch tensor
README.md:56:Data Loading Pipeline
TaobaoVDDataset Class
Fromdataset.py:29-190:
View Dataset Initialization Code
View Dataset Initialization Code
- Supports both manual transforms (DOVER++) and video processor (V-JEPA2)
- Automatic label detection
- Flexible resolution
- Mode-aware (train/val/test)
GPU-Optimized Collation
Fromdataset.py:193-288:
- OptimizedGPUCollate
- Batching Process
- Memory Optimization
- Batches video frames
- Encodes text prompts on GPU
- Moves data to GPU early
- Cleans up memory
DataLoader Configuration
Fromdataset.py:291-375:
- Training Loader
- Validation Loader
- Test Loader
- Worker Configuration
config.py:31,47:Complete Data Flow
Performance Considerations
- Bottlenecks
- Optimization Tips
- Profiling
Common bottlenecks in video data loading:
- Disk I/O: Reading video files from disk
- Solution: Use SSD storage, prefetch with num_workers
- Video decoding: Decompressing video codecs
- Solution: Decord with GPU acceleration
- Frame transformation: Resizing and converting
- Solution: Batch transforms, use GPU if possible
- Text encoding: BERT forward pass
- Solution: Encode in collate_fn (parallel with training)
- CPU→GPU transfer: Moving data to GPU
- Solution: pin_memory=True, early GPU transfer
Related Topics
Architecture
How preprocessed data flows through models
DOVER++ Model
640×640 resolution requirements
V-JEPA2 Model
384×384 resolution and video processor