Data Flow
Stage 1: Spatial Feature Extraction
CNN Backbone
A pretrained CNN (ResNet-50, ResNet-101, EfficientNet-V2-S, or EfficientNet-V2-M) extracts semantic features from individual frames. The classification head is removed and the model is frozenβall parameters have
requires_grad=False.Feature Dimensions
EfficientNet-V2-S/M backbones produce a 1280-dimensional feature vector per frame. ResNet-50/101 produce 2048-dim. The production ensemble checkpoints use 1280-dim (EfficientNet-V2).
Multi-Scale Extraction
Each video is processed at three temporal scales (
1.0Γ, 0.85Γ, 1.15Γ) to capture motion at different speeds. Features from all three scales are averaged together.HDF5 Pre-Extraction
Features are extracted once and saved to compressed
.h5 files (train_features_multiscale.h5, val_features_multiscale.h5, test_features_multiscale.h5). Training then reads from these files rather than running the CNN each epoch.Pre-extracting features is a critical performance optimization. The CNN backbone never runs during trainingβonly the lightweight temporal model is trained. This reduces Stage 2 training memory to ~6β7 GB on the NVIDIA A100 MIG partition.
Stage 2: Temporal Modeling
TheSuperEnhancedTemporalModel reads pre-extracted feature sequences and learns which frames matter and how they relate across time.
| Component | Configuration |
|---|---|
| Input projection | Linear(1280 β 768) + LayerNorm + ReLU + Dropout(0.2) |
| BiLSTM layers | 4 layers, hidden_dim=768, bidirectional β output dim 1536 |
| Self-attention | 12 heads, embed_dim=1536, residual connection + LayerNorm |
| Attention pooling | Linear scoring β softmax weights β weighted sum |
| Classifier MLP | 1536 β 768 β 512 β 256 β 4 with LayerNorm and Dropout(0.4) |
Model Configuration
The exact hyperparameters used for each ensemble checkpoint, as recorded inconfiguration_analysis.json:
Pre-Extraction Strategy
Extract frames
Each video is decoded to 64 uniformly-sampled frames, resized to 256Γ256 pixels on GPU.
Run CNN backbone
Frames are passed through the frozen backbone in batches of 24. The final pooling output (1280-d) is collected for every frame.
Apply multi-scale averaging
The extraction repeats at
0.85Γ and 1.15Γ temporal scales. All three scale outputs are averaged into a single [T, 1280] tensor.Save to HDF5
Features, labels, and per-video frame counts are written to a compressed
.h5 file. The dataset stores features in a zero-padded array of shape [N, max_frames, 1280].Classification Targets
The model classifies videos into four content categories sourced from YouTube-8M:| Index | Class | Description |
|---|---|---|
| 0 | Animation | Animated / CGI video content |
| 1 | Flat Content | Screen recordings, slides, static scenes |
| 2 | Gaming | Video game footage |
| 3 | Natural Content | Real-world video shot in natural environments |
Performance Summary
Standard Accuracy
~93% on the test set using a single model without augmentation.
With TTA
~95% by averaging predictions across 4 temporal augmentation modes.
Ensemble + TTA
95% F1 with 4 independently trained models averaged together.
Further Reading
- Spatial Feature Extraction β
EnhancedFeatureExtractorclass, backbone selection, multi-scale logic, HDF5 format - Temporal Modeling β
SuperEnhancedTemporalModelinternals, BiLSTM, attention pooling - Ensemble & TTA β How 4 checkpoints are averaged and the 4 TTA modes explained