NVIDIA Video Classification Project

Quick Start

Get up and running with inference in minutes using pre-extracted features.

Architecture Overview

Understand the two-stage CNN + Bi-LSTM + Attention pipeline.

Training Guide

Learn how to train the model from scratch on your own data.

Model Cards

Explore hyperparameters, metrics, and ensemble checkpoint details.

What this project does

This system classifies videos into four content categories — Animation, Gaming, Natural Content, and Flat Content — by jointly modeling spatial appearance and temporal dynamics. Unlike image classifiers, it understands motion, scene transitions, and long-range temporal patterns.

This project was developed as Project-1 (Industry-Sponsored by NVIDIA) and trained on NVIDIA A100 GPU (MIG partition, 9.8 GB VRAM) with 251 GB RAM.

Key highlights

~93–95% accuracy

Test accuracy of ~93% standard, ~95% with Test-Time Augmentation across all four categories.

Two-stage architecture

Stage 1: CNN spatial features. Stage 2: Bi-LSTM + Multi-Head Self-Attention temporal modeling.

4-model ensemble

Four independently trained models are averaged at inference for robustness.

Multi-scale features

Features extracted at three temporal scales (1.0×, 0.85×, 1.15×) and averaged.

Flask web app

A web UI for selecting dataset videos and running classification with optional TTA.

Built from scratch

No plug-and-play repositories — every component designed and trained end-to-end.

System architecture at a glance

Video Input
    │
    ▼
┌─────────────────────────────────────────┐
│  Stage 1: Spatial Feature Extraction    │
│  CNN Backbone (EfficientNet-V2 /        │
│  ResNet-50/101) → 1280-dim per frame    │
└─────────────────────────────────────────┘
    │  [N_videos, T_frames, 1280]
    ▼
┌─────────────────────────────────────────┐
│  Stage 2: Temporal Modeling             │
│  Input Projection → LayerNorm           │
│  4-layer Bidirectional LSTM (hidden=768)│
│  Multi-Head Self-Attention (12 heads)   │
│  Attention Pooling → Classifier MLP    │
└─────────────────────────────────────────┘
    │
    ▼
4-Class Output: Animation | Gaming |
                Natural Content | Flat Content

Dataset

Sourced from YouTube-8M with ~4,000 videos across four main categories and 46 subcategories:

Split	Proportion	Videos (approx.)
Train	70%	~2,800
Validation	20%	~800
Test	10%	~400

Explore the docs

Concepts

Deep-dive into the two-stage architecture, feature extraction backbones, temporal modeling, and ensemble strategy.

Training Guide

Dataset setup, preprocessing pipeline, training configuration, and optimization details.

Inference & Deployment

Run inference from the command line or via the Flask web application.

Evaluation

Per-class F1 scores, accuracy metrics, and ensemble + TTA performance results.

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Quick Start

Architecture Overview

Training Guide

Model Cards

What this project does

Key highlights

~93–95% accuracy

Two-stage architecture

4-model ensemble

Multi-scale features

Flask web app

Built from scratch

System architecture at a glance

Dataset

Explore the docs

Concepts

Training Guide

Inference & Deployment

Evaluation

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Quick Start

Architecture Overview

Training Guide

Model Cards

​What this project does

​Key highlights

~93–95% accuracy

Two-stage architecture

4-model ensemble

Multi-scale features

Flask web app

Built from scratch

​System architecture at a glance

​Dataset

​Explore the docs

Concepts

Training Guide

Inference & Deployment

Evaluation

Build docs developers (and LLMs) love

What this project does

Key highlights

System architecture at a glance

Dataset

Explore the docs