Skip to main content

Quick Start

Get up and running with inference in minutes using pre-extracted features.

Architecture Overview

Understand the two-stage CNN + Bi-LSTM + Attention pipeline.

Training Guide

Learn how to train the model from scratch on your own data.

Model Cards

Explore hyperparameters, metrics, and ensemble checkpoint details.

What this project does

This system classifies videos into four content categories — Animation, Gaming, Natural Content, and Flat Content — by jointly modeling spatial appearance and temporal dynamics. Unlike image classifiers, it understands motion, scene transitions, and long-range temporal patterns.
This project was developed as Project-1 (Industry-Sponsored by NVIDIA) and trained on NVIDIA A100 GPU (MIG partition, 9.8 GB VRAM) with 251 GB RAM.

Key highlights

~93–95% accuracy

Test accuracy of ~93% standard, ~95% with Test-Time Augmentation across all four categories.

Two-stage architecture

Stage 1: CNN spatial features. Stage 2: Bi-LSTM + Multi-Head Self-Attention temporal modeling.

4-model ensemble

Four independently trained models are averaged at inference for robustness.

Multi-scale features

Features extracted at three temporal scales (1.0×, 0.85×, 1.15×) and averaged.

Flask web app

A web UI for selecting dataset videos and running classification with optional TTA.

Built from scratch

No plug-and-play repositories — every component designed and trained end-to-end.

System architecture at a glance

Video Input


┌─────────────────────────────────────────┐
│  Stage 1: Spatial Feature Extraction    │
│  CNN Backbone (EfficientNet-V2 /        │
│  ResNet-50/101) → 1280-dim per frame    │
└─────────────────────────────────────────┘
    │  [N_videos, T_frames, 1280]

┌─────────────────────────────────────────┐
│  Stage 2: Temporal Modeling             │
│  Input Projection → LayerNorm           │
│  4-layer Bidirectional LSTM (hidden=768)│
│  Multi-Head Self-Attention (12 heads)   │
│  Attention Pooling → Classifier MLP    │
└─────────────────────────────────────────┘


4-Class Output: Animation | Gaming |
                Natural Content | Flat Content

Dataset

Sourced from YouTube-8M with ~4,000 videos across four main categories and 46 subcategories:
SplitProportionVideos (approx.)
Train70%~2,800
Validation20%~800
Test10%~400

Explore the docs

Concepts

Deep-dive into the two-stage architecture, feature extraction backbones, temporal modeling, and ensemble strategy.

Training Guide

Dataset setup, preprocessing pipeline, training configuration, and optimization details.

Inference & Deployment

Run inference from the command line or via the Flask web application.

Evaluation

Per-class F1 scores, accuracy metrics, and ensemble + TTA performance results.

Build docs developers (and LLMs) love