Architecture Overview

The NVIDIA Video Classification system is built around a two-stage pipeline that decouples spatial understanding from temporal reasoning. Stage 1 extracts rich per-frame features using a pretrained CNN backbone; Stage 2 models the temporal dynamics of those features using a deep Bidirectional LSTM followed by Multi-Head Self-Attention.

Data Flow

Raw Video
    │
    ▼
┌─────────────────────────────────────┐
│  STAGE 1 – Spatial Feature          │
│  Extraction (EnhancedFeatureExtractor)│
│                                     │
│  Frames (256×256, 64 per video)     │
│       │                             │
│       ▼                             │
│  Pretrained CNN Backbone            │
│  (ResNet50/101 or EfficientNet-V2)  │
│       │                             │
│       ▼                             │
│  Per-frame feature vectors          │
│  [T × 1280]                         │
│       │                             │
│  Multi-scale averaging (×3 scales)  │
│       │                             │
│       ▼                             │
│  Saved → .h5 file (offline)         │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│  STAGE 2 – Temporal Modeling        │
│  (SuperEnhancedTemporalModel)       │
│                                     │
│  Input Projection                   │
│  Linear(1280→768) + LayerNorm + ReLU│
│       │                             │
│       ▼                             │
│  4-Layer Bidirectional LSTM         │
│  hidden_dim=768 → output: 1536      │
│       │                             │
│       ▼                             │
│  Multi-Head Self-Attention          │
│  12 heads, residual + LayerNorm     │
│       │                             │
│       ▼                             │
│  Attention Pooling                  │
│  Weighted sum → [1536]              │
│       │                             │
│       ▼                             │
│  Classifier MLP                     │
│  1536→768→512→256→4 classes         │
└──────────────┬──────────────────────┘
               │
               ▼
  Softmax → Animation / Gaming /
            Natural Content / Flat Content

Stage 1: Spatial Feature Extraction

CNN Backbone

A pretrained CNN (ResNet-50, ResNet-101, EfficientNet-V2-S, or EfficientNet-V2-M) extracts semantic features from individual frames. The classification head is removed and the model is frozen—all parameters have requires_grad=False.

Feature Dimensions

EfficientNet-V2-S/M backbones produce a 1280-dimensional feature vector per frame. ResNet-50/101 produce 2048-dim. The production ensemble checkpoints use 1280-dim (EfficientNet-V2).

Multi-Scale Extraction

Each video is processed at three temporal scales (1.0×, 0.85×, 1.15×) to capture motion at different speeds. Features from all three scales are averaged together.

HDF5 Pre-Extraction

Features are extracted once and saved to compressed .h5 files (train_features_multiscale.h5, val_features_multiscale.h5, test_features_multiscale.h5). Training then reads from these files rather than running the CNN each epoch.

Pre-extracting features is a critical performance optimization. The CNN backbone never runs during training—only the lightweight temporal model is trained. This reduces Stage 2 training memory to ~6–7 GB on the NVIDIA A100 MIG partition.

Stage 2: Temporal Modeling

The SuperEnhancedTemporalModel reads pre-extracted feature sequences and learns which frames matter and how they relate across time.

Component	Configuration
Input projection	`Linear(1280 → 768)` + `LayerNorm` + `ReLU` + `Dropout(0.2)`
BiLSTM layers	4 layers, `hidden_dim=768`, bidirectional → output dim `1536`
Self-attention	12 heads, `embed_dim=1536`, residual connection + `LayerNorm`
Attention pooling	Linear scoring → softmax weights → weighted sum
Classifier MLP	`1536 → 768 → 512 → 256 → 4` with `LayerNorm` and `Dropout(0.4)`

Model Configuration

The exact hyperparameters used for each ensemble checkpoint, as recorded in configuration_analysis.json:

{
  "feature_dim": 1280,
  "hidden_dim": 768,
  "num_classes": 4,
  "num_lstm_layers": 4,
  "num_attention_heads": 12,
  "dropout": 0.4,
  "bidirectional": true
}

Pre-Extraction Strategy

Extract frames

Each video is decoded to 64 uniformly-sampled frames, resized to 256×256 pixels on GPU.

Run CNN backbone

Frames are passed through the frozen backbone in batches of 24. The final pooling output (1280-d) is collected for every frame.

Apply multi-scale averaging

The extraction repeats at 0.85× and 1.15× temporal scales. All three scale outputs are averaged into a single [T, 1280] tensor.

Save to HDF5

Features, labels, and per-video frame counts are written to a compressed .h5 file. The dataset stores features in a zero-padded array of shape [N, max_frames, 1280].

Train temporal model

The EnhancedTemporalModelTrainer reads directly from the .h5 file. The CNN never runs again during training.

Classification Targets

The model classifies videos into four content categories sourced from YouTube-8M:

Index	Class	Description
0	Animation	Animated / CGI video content
1	Flat Content	Screen recordings, slides, static scenes
2	Gaming	Video game footage
3	Natural Content	Real-world video shot in natural environments

Performance Summary

Standard Accuracy

~93% on the test set using a single model without augmentation.

With TTA

~95% by averaging predictions across 4 temporal augmentation modes.

Ensemble + TTA

95% F1 with 4 independently trained models averaged together.

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Data Flow

Stage 1: Spatial Feature Extraction

CNN Backbone

Feature Dimensions

Multi-Scale Extraction

HDF5 Pre-Extraction

Stage 2: Temporal Modeling

Model Configuration

Pre-Extraction Strategy

Classification Targets

Performance Summary

Standard Accuracy

With TTA

Ensemble + TTA

Further Reading

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​Data Flow

​Stage 1: Spatial Feature Extraction

CNN Backbone

Feature Dimensions

Multi-Scale Extraction

HDF5 Pre-Extraction

​Stage 2: Temporal Modeling

​Model Configuration

​Pre-Extraction Strategy

​Classification Targets

​Performance Summary

Standard Accuracy

With TTA

Ensemble + TTA

​Further Reading

Build docs developers (and LLMs) love

Data Flow

Stage 1: Spatial Feature Extraction

Stage 2: Temporal Modeling

Model Configuration

Pre-Extraction Strategy

Classification Targets

Performance Summary

Further Reading