Skip to main content

What Is This Project?

The NVIDIA Video Classification Project is a deep learning–based multi-class video classification system built end-to-end to understand both spatial and temporal patterns in video data. It was developed as Project-1 (Industry-Sponsored by NVIDIA) and trained on NVIDIA GPU servers (A100 MIG partition). Unlike image classifiers that process a single frame, video understanding requires modeling motion, temporal dependencies, and long-range context across many frames. This project addresses that challenge through a two-stage architecture that first extracts per-frame spatial features with a pretrained CNN and then models the sequence of those features with a Bidirectional LSTM and Multi-Head Self-Attention.
The system achieves ~93% test accuracy (up to 95% with Test-Time Augmentation) and F1 scores above 95% when using model ensembling with TTA.

Content Categories

The classifier distinguishes four broad video content types sourced from YouTube-8M:

Animation

Animated video content including cartoons, CGI, and motion graphics. Visually distinct due to artificially generated textures and motion.

Gaming

Gameplay footage and screen recordings. Characterized by HUD elements, synthetic environments, and rapid on-screen motion.

Natural Content

Real-world footage of nature, outdoor scenes, and organic subjects. High variance in lighting, color, and motion dynamics.

Flat Content

Static or near-static visual content such as slides, screen shares, talking-head video, and documents. Minimal temporal change.

Two-Stage Architecture

The system is structured as a sequential two-stage pipeline:

Stage 1 — Spatial Feature Extraction

A pretrained CNN backbone processes each video frame independently and produces a fixed-length embedding vector. Supported backbones:
  • ResNet-50 / ResNet-101 — ImageNet pretrained, 2048-dim output
  • EfficientNet-V2-S / EfficientNet-V2-M — ImageNet pretrained, 1280-dim output
Multi-scale temporal sampling (scales: 0.85×, 1.0×, 1.15×) is used during training to improve robustness. The final per-frame feature dimension used in production is 1280.

Stage 2 — Temporal Modeling (SuperEnhancedTemporalModel)

The sequence of frame-level embeddings from Stage 1 is fed into the temporal model:
ComponentConfiguration
Input ProjectionLinear → LayerNorm → ReLU → Dropout
BiLSTM4 layers, hidden dim 768, bidirectional
Multi-Head Self-Attention12 heads, embed dim 1536 (768 × 2)
Attention PoolingWeighted sum over timesteps
Classifier HeadLinear(1536→768) → LN → ReLU → Linear(768→512) → LN → ReLU → Linear(512→256) → LN → ReLU → Linear(256→4)
The BiLSTM processes frames in both forward and backward directions simultaneously, allowing the model to capture context from both past and future frames when classifying any given moment in the video.

Dataset

PropertyValue
SourceYouTube-8M
Total videos~4,000
Categories4 main classes, 46 subcategories
Train split70%
Validation split20%
Test split10%
Frames per video73 (uniform sampling)
Feature file formatHDF5 (.h5) with gzip compression

Training Configuration

HyperparameterValue
FrameworkPyTorch
Loss FunctionFocal Loss with Label Smoothing (γ=2.0, smoothing=0.1)
OptimizerAdamW (lr=0.001, weight decay applied)
LR SchedulerCosine Annealing with Warm Restarts
Epochs150 (early stopping patience=25)
Batch Size48
RegularizationDropout (0.4), Gradient Clipping, Weight Decay
SamplingWeightedRandomSampler for class balance

Hardware and Software Requirements

Hardware

GPU: NVIDIA A100 (MIG partition, 9.8 GB VRAM)CPU: Intel Xeon GoldRAM: 251 GB system memoryA CUDA-capable GPU is strongly recommended for training. Inference can run on CPU for small batches.

Software

Python: 3.8+PyTorch + Torchvision: CUDA 11.8 buildcuDNN: Bundled with CUDA 11.8Key packages: h5py, numpy, pandas, opencv-python, scikit-learn, flask, tqdm, psutil, matplotlib, seaborn
Training from scratch requires substantial GPU memory and wall-clock time (30–50 hours on an A100 MIG partition). For evaluation and inference, pre-extracted .h5 feature files and saved model checkpoints are all that is needed.

Performance Summary

ModeAccuracyWeighted F1
Standard inference (ensemble of 4)~93%~92%
With Test-Time Augmentation (TTA)~95%>95%
Per-class F1 scores for the best individual ensemble member (best_ensemble_model_2.pt, val acc 92.1%):
ClassF1 Score
Animation86.5%
Flat Content96.7%
Gaming87.9%
Natural Content97.5%

Team

This project was developed as an industry-sponsored academic project in partnership with NVIDIA.
RoleName
DeveloperManas Kulkarni
DeveloperSamiksha Nalawade
DeveloperRajlakshmi Desai
Faculty GuideDr. Shripad Bhatlawande
Industry SponsorNVIDIA

Next Steps

Quickstart

Set up the environment, run the Flask inference server, and classify your first video.

Architecture

Deep dive into the SuperEnhancedTemporalModel, attention mechanism, and ensemble strategy.

Training Guide

Reproduce the full training pipeline from raw video files to saved checkpoints.

Build docs developers (and LLMs) love