Introduction

What Is This Project?

The NVIDIA Video Classification Project is a deep learning–based multi-class video classification system built end-to-end to understand both spatial and temporal patterns in video data. It was developed as Project-1 (Industry-Sponsored by NVIDIA) and trained on NVIDIA GPU servers (A100 MIG partition). Unlike image classifiers that process a single frame, video understanding requires modeling motion, temporal dependencies, and long-range context across many frames. This project addresses that challenge through a two-stage architecture that first extracts per-frame spatial features with a pretrained CNN and then models the sequence of those features with a Bidirectional LSTM and Multi-Head Self-Attention.

The system achieves ~93% test accuracy (up to 95% with Test-Time Augmentation) and F1 scores above 95% when using model ensembling with TTA.

Content Categories

The classifier distinguishes four broad video content types sourced from YouTube-8M:

Animation

Animated video content including cartoons, CGI, and motion graphics. Visually distinct due to artificially generated textures and motion.

Gaming

Gameplay footage and screen recordings. Characterized by HUD elements, synthetic environments, and rapid on-screen motion.

Natural Content

Real-world footage of nature, outdoor scenes, and organic subjects. High variance in lighting, color, and motion dynamics.

Flat Content

Static or near-static visual content such as slides, screen shares, talking-head video, and documents. Minimal temporal change.

Two-Stage Architecture

The system is structured as a sequential two-stage pipeline:

Stage 1 — Spatial Feature Extraction

A pretrained CNN backbone processes each video frame independently and produces a fixed-length embedding vector. Supported backbones:

ResNet-50 / ResNet-101 — ImageNet pretrained, 2048-dim output
EfficientNet-V2-S / EfficientNet-V2-M — ImageNet pretrained, 1280-dim output

Multi-scale temporal sampling (scales: 0.85×, 1.0×, 1.15×) is used during training to improve robustness. The final per-frame feature dimension used in production is 1280.

Stage 2 — Temporal Modeling (`SuperEnhancedTemporalModel`)

The sequence of frame-level embeddings from Stage 1 is fed into the temporal model:

Component	Configuration
Input Projection	Linear → LayerNorm → ReLU → Dropout
BiLSTM	4 layers, hidden dim 768, bidirectional
Multi-Head Self-Attention	12 heads, embed dim 1536 (768 × 2)
Attention Pooling	Weighted sum over timesteps
Classifier Head	Linear(1536→768) → LN → ReLU → Linear(768→512) → LN → ReLU → Linear(512→256) → LN → ReLU → Linear(256→4)

The BiLSTM processes frames in both forward and backward directions simultaneously, allowing the model to capture context from both past and future frames when classifying any given moment in the video.

Dataset

Property	Value
Source	YouTube-8M
Total videos	~4,000
Categories	4 main classes, 46 subcategories
Train split	70%
Validation split	20%
Test split	10%
Frames per video	73 (uniform sampling)
Feature file format	HDF5 (`.h5`) with gzip compression

Training Configuration

Hyperparameter	Value
Framework	PyTorch
Loss Function	Focal Loss with Label Smoothing (γ=2.0, smoothing=0.1)
Optimizer	AdamW (lr=0.001, weight decay applied)
LR Scheduler	Cosine Annealing with Warm Restarts
Epochs	150 (early stopping patience=25)
Batch Size	48
Regularization	Dropout (0.4), Gradient Clipping, Weight Decay
Sampling	WeightedRandomSampler for class balance

Hardware and Software Requirements

Hardware

GPU: NVIDIA A100 (MIG partition, 9.8 GB VRAM)CPU: Intel Xeon GoldRAM: 251 GB system memoryA CUDA-capable GPU is strongly recommended for training. Inference can run on CPU for small batches.

Software

Python: 3.8+PyTorch + Torchvision: CUDA 11.8 buildcuDNN: Bundled with CUDA 11.8Key packages: h5py, numpy, pandas, opencv-python, scikit-learn, flask, tqdm, psutil, matplotlib, seaborn

Training from scratch requires substantial GPU memory and wall-clock time (30–50 hours on an A100 MIG partition). For evaluation and inference, pre-extracted .h5 feature files and saved model checkpoints are all that is needed.

Performance Summary

Mode	Accuracy	Weighted F1
Standard inference (ensemble of 4)	~93%	~92%
With Test-Time Augmentation (TTA)	~95%	>95%

Per-class F1 scores for the best individual ensemble member (best_ensemble_model_2.pt, val acc 92.1%):

Class	F1 Score
Animation	86.5%
Flat Content	96.7%
Gaming	87.9%
Natural Content	97.5%

Team

This project was developed as an industry-sponsored academic project in partnership with NVIDIA.

Role	Name
Developer	Manas Kulkarni
Developer	Samiksha Nalawade
Developer	Rajlakshmi Desai
Faculty Guide	Dr. Shripad Bhatlawande
Industry Sponsor	NVIDIA

Next Steps

Quickstart

Set up the environment, run the Flask inference server, and classify your first video.

Architecture

Deep dive into the SuperEnhancedTemporalModel, attention mechanism, and ensemble strategy.

Training Guide

Reproduce the full training pipeline from raw video files to saved checkpoints.

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

What Is This Project?

Content Categories

Animation

Gaming

Natural Content

Flat Content

Two-Stage Architecture

Stage 1 — Spatial Feature Extraction

Stage 2 — Temporal Modeling (`SuperEnhancedTemporalModel`)

Dataset

Training Configuration

Hardware and Software Requirements

Hardware

Software

Performance Summary

Team

Next Steps

Quickstart

Architecture

Training Guide

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​What Is This Project?

​Content Categories

Animation

Gaming

Natural Content

Flat Content

​Two-Stage Architecture

​Stage 1 — Spatial Feature Extraction

​Stage 2 — Temporal Modeling (SuperEnhancedTemporalModel)

​Dataset

​Training Configuration

​Hardware and Software Requirements

Hardware

Software

​Performance Summary

​Team

​Next Steps

Quickstart

Architecture

Training Guide

Build docs developers (and LLMs) love

What Is This Project?

Content Categories

Two-Stage Architecture

Stage 1 — Spatial Feature Extraction

Stage 2 — Temporal Modeling (`SuperEnhancedTemporalModel`)

Dataset

Training Configuration

Hardware and Software Requirements

Performance Summary

Team

Next Steps