Quick Start
Get up and running with inference in minutes using pre-extracted features.
Architecture Overview
Understand the two-stage CNN + Bi-LSTM + Attention pipeline.
Training Guide
Learn how to train the model from scratch on your own data.
Model Cards
Explore hyperparameters, metrics, and ensemble checkpoint details.
What this project does
This system classifies videos into four content categories — Animation, Gaming, Natural Content, and Flat Content — by jointly modeling spatial appearance and temporal dynamics. Unlike image classifiers, it understands motion, scene transitions, and long-range temporal patterns.This project was developed as Project-1 (Industry-Sponsored by NVIDIA) and trained on NVIDIA A100 GPU (MIG partition, 9.8 GB VRAM) with 251 GB RAM.
Key highlights
~93–95% accuracy
Test accuracy of ~93% standard, ~95% with Test-Time Augmentation across all four categories.
Two-stage architecture
Stage 1: CNN spatial features. Stage 2: Bi-LSTM + Multi-Head Self-Attention temporal modeling.
4-model ensemble
Four independently trained models are averaged at inference for robustness.
Multi-scale features
Features extracted at three temporal scales (1.0×, 0.85×, 1.15×) and averaged.
Flask web app
A web UI for selecting dataset videos and running classification with optional TTA.
Built from scratch
No plug-and-play repositories — every component designed and trained end-to-end.
System architecture at a glance
Dataset
Sourced from YouTube-8M with ~4,000 videos across four main categories and 46 subcategories:| Split | Proportion | Videos (approx.) |
|---|---|---|
| Train | 70% | ~2,800 |
| Validation | 20% | ~800 |
| Test | 10% | ~400 |
Explore the docs
Concepts
Deep-dive into the two-stage architecture, feature extraction backbones, temporal modeling, and ensemble strategy.
Training Guide
Dataset setup, preprocessing pipeline, training configuration, and optimization details.
Inference & Deployment
Run inference from the command line or via the Flask web application.
Evaluation
Per-class F1 scores, accuracy metrics, and ensemble + TTA performance results.