Skip to main content
The NVIDIA Video Classification system is built around a two-stage pipeline that decouples spatial understanding from temporal reasoning. Stage 1 extracts rich per-frame features using a pretrained CNN backbone; Stage 2 models the temporal dynamics of those features using a deep Bidirectional LSTM followed by Multi-Head Self-Attention.

Data Flow

Raw Video
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE 1 – Spatial Feature          β”‚
β”‚  Extraction (EnhancedFeatureExtractor)β”‚
β”‚                                     β”‚
β”‚  Frames (256Γ—256, 64 per video)     β”‚
β”‚       β”‚                             β”‚
β”‚       β–Ό                             β”‚
β”‚  Pretrained CNN Backbone            β”‚
β”‚  (ResNet50/101 or EfficientNet-V2)  β”‚
β”‚       β”‚                             β”‚
β”‚       β–Ό                             β”‚
β”‚  Per-frame feature vectors          β”‚
β”‚  [T Γ— 1280]                         β”‚
β”‚       β”‚                             β”‚
β”‚  Multi-scale averaging (Γ—3 scales)  β”‚
β”‚       β”‚                             β”‚
β”‚       β–Ό                             β”‚
β”‚  Saved β†’ .h5 file (offline)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE 2 – Temporal Modeling        β”‚
β”‚  (SuperEnhancedTemporalModel)       β”‚
β”‚                                     β”‚
β”‚  Input Projection                   β”‚
β”‚  Linear(1280β†’768) + LayerNorm + ReLUβ”‚
β”‚       β”‚                             β”‚
β”‚       β–Ό                             β”‚
β”‚  4-Layer Bidirectional LSTM         β”‚
β”‚  hidden_dim=768 β†’ output: 1536      β”‚
β”‚       β”‚                             β”‚
β”‚       β–Ό                             β”‚
β”‚  Multi-Head Self-Attention          β”‚
β”‚  12 heads, residual + LayerNorm     β”‚
β”‚       β”‚                             β”‚
β”‚       β–Ό                             β”‚
β”‚  Attention Pooling                  β”‚
β”‚  Weighted sum β†’ [1536]              β”‚
β”‚       β”‚                             β”‚
β”‚       β–Ό                             β”‚
β”‚  Classifier MLP                     β”‚
β”‚  1536β†’768β†’512β†’256β†’4 classes         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
  Softmax β†’ Animation / Gaming /
            Natural Content / Flat Content

Stage 1: Spatial Feature Extraction

CNN Backbone

A pretrained CNN (ResNet-50, ResNet-101, EfficientNet-V2-S, or EfficientNet-V2-M) extracts semantic features from individual frames. The classification head is removed and the model is frozenβ€”all parameters have requires_grad=False.

Feature Dimensions

EfficientNet-V2-S/M backbones produce a 1280-dimensional feature vector per frame. ResNet-50/101 produce 2048-dim. The production ensemble checkpoints use 1280-dim (EfficientNet-V2).

Multi-Scale Extraction

Each video is processed at three temporal scales (1.0Γ—, 0.85Γ—, 1.15Γ—) to capture motion at different speeds. Features from all three scales are averaged together.

HDF5 Pre-Extraction

Features are extracted once and saved to compressed .h5 files (train_features_multiscale.h5, val_features_multiscale.h5, test_features_multiscale.h5). Training then reads from these files rather than running the CNN each epoch.
Pre-extracting features is a critical performance optimization. The CNN backbone never runs during trainingβ€”only the lightweight temporal model is trained. This reduces Stage 2 training memory to ~6–7 GB on the NVIDIA A100 MIG partition.

Stage 2: Temporal Modeling

The SuperEnhancedTemporalModel reads pre-extracted feature sequences and learns which frames matter and how they relate across time.
ComponentConfiguration
Input projectionLinear(1280 β†’ 768) + LayerNorm + ReLU + Dropout(0.2)
BiLSTM layers4 layers, hidden_dim=768, bidirectional β†’ output dim 1536
Self-attention12 heads, embed_dim=1536, residual connection + LayerNorm
Attention poolingLinear scoring β†’ softmax weights β†’ weighted sum
Classifier MLP1536 β†’ 768 β†’ 512 β†’ 256 β†’ 4 with LayerNorm and Dropout(0.4)

Model Configuration

The exact hyperparameters used for each ensemble checkpoint, as recorded in configuration_analysis.json:
{
  "feature_dim": 1280,
  "hidden_dim": 768,
  "num_classes": 4,
  "num_lstm_layers": 4,
  "num_attention_heads": 12,
  "dropout": 0.4,
  "bidirectional": true
}

Pre-Extraction Strategy

1

Extract frames

Each video is decoded to 64 uniformly-sampled frames, resized to 256Γ—256 pixels on GPU.
2

Run CNN backbone

Frames are passed through the frozen backbone in batches of 24. The final pooling output (1280-d) is collected for every frame.
3

Apply multi-scale averaging

The extraction repeats at 0.85Γ— and 1.15Γ— temporal scales. All three scale outputs are averaged into a single [T, 1280] tensor.
4

Save to HDF5

Features, labels, and per-video frame counts are written to a compressed .h5 file. The dataset stores features in a zero-padded array of shape [N, max_frames, 1280].
5

Train temporal model

The EnhancedTemporalModelTrainer reads directly from the .h5 file. The CNN never runs again during training.

Classification Targets

The model classifies videos into four content categories sourced from YouTube-8M:
IndexClassDescription
0AnimationAnimated / CGI video content
1Flat ContentScreen recordings, slides, static scenes
2GamingVideo game footage
3Natural ContentReal-world video shot in natural environments

Performance Summary

Standard Accuracy

~93% on the test set using a single model without augmentation.

With TTA

~95% by averaging predictions across 4 temporal augmentation modes.

Ensemble + TTA

95% F1 with 4 independently trained models averaged together.

Further Reading

  • Spatial Feature Extraction β€” EnhancedFeatureExtractor class, backbone selection, multi-scale logic, HDF5 format
  • Temporal Modeling β€” SuperEnhancedTemporalModel internals, BiLSTM, attention pooling
  • Ensemble & TTA β€” How 4 checkpoints are averaged and the 4 TTA modes explained

Build docs developers (and LLMs) love