What Is This Project?
The NVIDIA Video Classification Project is a deep learning–based multi-class video classification system built end-to-end to understand both spatial and temporal patterns in video data. It was developed as Project-1 (Industry-Sponsored by NVIDIA) and trained on NVIDIA GPU servers (A100 MIG partition). Unlike image classifiers that process a single frame, video understanding requires modeling motion, temporal dependencies, and long-range context across many frames. This project addresses that challenge through a two-stage architecture that first extracts per-frame spatial features with a pretrained CNN and then models the sequence of those features with a Bidirectional LSTM and Multi-Head Self-Attention.The system achieves ~93% test accuracy (up to 95% with Test-Time Augmentation) and F1 scores above 95% when using model ensembling with TTA.
Content Categories
The classifier distinguishes four broad video content types sourced from YouTube-8M:Animation
Animated video content including cartoons, CGI, and motion graphics. Visually distinct due to artificially generated textures and motion.
Gaming
Gameplay footage and screen recordings. Characterized by HUD elements, synthetic environments, and rapid on-screen motion.
Natural Content
Real-world footage of nature, outdoor scenes, and organic subjects. High variance in lighting, color, and motion dynamics.
Flat Content
Static or near-static visual content such as slides, screen shares, talking-head video, and documents. Minimal temporal change.
Two-Stage Architecture
The system is structured as a sequential two-stage pipeline:Stage 1 — Spatial Feature Extraction
A pretrained CNN backbone processes each video frame independently and produces a fixed-length embedding vector. Supported backbones:- ResNet-50 / ResNet-101 — ImageNet pretrained, 2048-dim output
- EfficientNet-V2-S / EfficientNet-V2-M — ImageNet pretrained, 1280-dim output
Stage 2 — Temporal Modeling (SuperEnhancedTemporalModel)
The sequence of frame-level embeddings from Stage 1 is fed into the temporal model:
| Component | Configuration |
|---|---|
| Input Projection | Linear → LayerNorm → ReLU → Dropout |
| BiLSTM | 4 layers, hidden dim 768, bidirectional |
| Multi-Head Self-Attention | 12 heads, embed dim 1536 (768 × 2) |
| Attention Pooling | Weighted sum over timesteps |
| Classifier Head | Linear(1536→768) → LN → ReLU → Linear(768→512) → LN → ReLU → Linear(512→256) → LN → ReLU → Linear(256→4) |
The BiLSTM processes frames in both forward and backward directions simultaneously, allowing the model to capture context from both past and future frames when classifying any given moment in the video.
Dataset
| Property | Value |
|---|---|
| Source | YouTube-8M |
| Total videos | ~4,000 |
| Categories | 4 main classes, 46 subcategories |
| Train split | 70% |
| Validation split | 20% |
| Test split | 10% |
| Frames per video | 73 (uniform sampling) |
| Feature file format | HDF5 (.h5) with gzip compression |
Training Configuration
| Hyperparameter | Value |
|---|---|
| Framework | PyTorch |
| Loss Function | Focal Loss with Label Smoothing (γ=2.0, smoothing=0.1) |
| Optimizer | AdamW (lr=0.001, weight decay applied) |
| LR Scheduler | Cosine Annealing with Warm Restarts |
| Epochs | 150 (early stopping patience=25) |
| Batch Size | 48 |
| Regularization | Dropout (0.4), Gradient Clipping, Weight Decay |
| Sampling | WeightedRandomSampler for class balance |
Hardware and Software Requirements
Hardware
GPU: NVIDIA A100 (MIG partition, 9.8 GB VRAM)CPU: Intel Xeon GoldRAM: 251 GB system memoryA CUDA-capable GPU is strongly recommended for training. Inference can run on CPU for small batches.
Software
Python: 3.8+PyTorch + Torchvision: CUDA 11.8 buildcuDNN: Bundled with CUDA 11.8Key packages:
h5py, numpy, pandas, opencv-python, scikit-learn, flask, tqdm, psutil, matplotlib, seabornPerformance Summary
| Mode | Accuracy | Weighted F1 |
|---|---|---|
| Standard inference (ensemble of 4) | ~93% | ~92% |
| With Test-Time Augmentation (TTA) | ~95% | >95% |
best_ensemble_model_2.pt, val acc 92.1%):
| Class | F1 Score |
|---|---|
| Animation | 86.5% |
| Flat Content | 96.7% |
| Gaming | 87.9% |
| Natural Content | 97.5% |
Team
This project was developed as an industry-sponsored academic project in partnership with NVIDIA.| Role | Name |
|---|---|
| Developer | Manas Kulkarni |
| Developer | Samiksha Nalawade |
| Developer | Rajlakshmi Desai |
| Faculty Guide | Dr. Shripad Bhatlawande |
| Industry Sponsor | NVIDIA |
Next Steps
Quickstart
Set up the environment, run the Flask inference server, and classify your first video.
Architecture
Deep dive into the
SuperEnhancedTemporalModel, attention mechanism, and ensemble strategy.Training Guide
Reproduce the full training pipeline from raw video files to saved checkpoints.