This model is the core temporal classifier in the NVIDIA Video Classification Project. It operates on pre-extracted frame-level CNN features and produces one of four content-category predictions: Animation, Flat_Content, Gaming, or Natural_Content.
Overview
SuperEnhancedTemporalModel is a sequence-to-label neural network designed for video content classification. Given a sequence of per-frame feature vectors (extracted by a pretrained CNN backbone such as EfficientNet-V2-S), it outputs a probability distribution over four content categories.
The architecture combines:
- A linear input projection to normalize feature space dimensionality
- A deep bidirectional LSTM to model temporal dynamics in both directions
- Multi-head self-attention with a residual connection to focus on the most discriminative video segments
- Attention pooling to aggregate the sequence into a single vector
- A deep MLP classifier with progressive width reduction and regularization
Target Task
Four-class video content classification: Animation, Flat_Content, Gaming, Natural_Content
Input
Variable-length sequence of CNN frame features — shape
[T, 1280]Output
Class logits of shape
[batch, 4]; apply softmax for probabilitiesPerformance Target
>95% validation accuracy with ensemble + TTA
Architecture Details
The full forward pass proceeds through five sequential stages.Input Projection
A 3-layer MLP normalizes the raw CNN features from Dropout here is
feature_dim=1280 into the model’s internal hidden_dim=768 space, applying regularization before any sequence modeling.dropout * 0.5 = 0.2 (half the main dropout rate), keeping the projection lighter.Bidirectional LSTM
A 4-layer stacked BiLSTM processes the projected sequence in both temporal directions, capturing forward context (motion build-up) and backward context (motion resolution).The concatenation of forward and backward hidden states produces an output dimensionality of 1536 at each timestep.
Multi-Head Self-Attention + Residual
Self-attention lets every timestep attend to every other timestep, enabling the model to discover long-range temporal dependencies. A residual connection and LayerNorm stabilize training.With
embed_dim=1536 and num_heads=12, each attention head operates on a 128-dimensional subspace.Attention Pooling
Instead of naive mean- or max-pooling, a learned scoring network assigns importance weights to each timestep and computes a weighted sum, collapsing
[B, T, 1536] → [B, 1536].Hyperparameters
Model Configuration
| Parameter | Default | Description |
|---|---|---|
feature_dim | 2048 (train) / 1280 (ensemble) | Input CNN feature dimension |
hidden_dim | 768 | LSTM hidden size and internal projection dim |
num_classes | 4 | Number of output classes |
num_lstm_layers | 4 | Number of stacked BiLSTM layers |
num_attention_heads | 12 | Number of self-attention heads |
dropout | 0.4 | Base dropout rate (halved in projection and final MLP block) |
bidirectional | True | Whether the LSTM is bidirectional |
The ensemble checkpoints (
best_ensemble_model_*.pt) were saved with feature_dim=1280, matching the EfficientNet-V2-S backbone used during feature extraction. The default value of 2048 in the class signature reflects a ResNet backbone default and is overridden at instantiation time.Training Configuration
| Parameter | Value | Description |
|---|---|---|
learning_rate | 0.001 | Initial AdamW learning rate |
batch_size | 48 | Videos per batch |
num_epochs | 150 | Maximum training epochs |
patience | 25 | Early stopping patience (epochs without val improvement) |
weight_decay | 5e-4 | AdamW L2 regularization |
gradient_clip | 1.0 | Max gradient norm |
scheduler | CosineAnnealingWarmRestarts | T_0=20, T_mult=2, eta_min=1e-6 |
loss_fn | FocalLoss | gamma=2.0, label smoothing 0.1, class-weighted alpha |
Weight Initialization
All weights are initialized in_initialize_weights(), called at the end of __init__:
| Module type | Weight init | Bias init |
|---|---|---|
nn.Linear | Xavier uniform (xavier_uniform_) | Constant 0 |
nn.LayerNorm | Constant 1 | Constant 0 |
[-1/√H, 1/√H]).
Full Model Code
Data Flow Summary
Limitations and Intended Use
Intended use
Intended use
- Classifying short video clips (up to ~73 frames) into one of four content categories
- Backend inference in the Flask deployment, either as a standalone model or as part of the 4-model ensemble
- Research into temporal sequence modeling for video understanding
Known limitations
Known limitations
- Class imbalance sensitivity: Animation F1 is substantially lower than other classes (see ensemble checkpoints page). The focal loss and weighted sampler partially mitigate this but do not eliminate it.
- Fixed backbone coupling: The model assumes 1280-dimensional input features from EfficientNet-V2-S (or a compatible backbone). Using a different backbone without retraining will degrade performance.
- Short-clip assumption: The attention pooling is most effective within the ~73-frame window used during training. Very long videos should be segmented before inference.
- No temporal localization: The model outputs a single label per clip; it cannot identify where in the clip the classifying content appears.
Out-of-scope uses
Out-of-scope uses
- Fine-grained action recognition (the four classes are broad content categories)
- Audio-based classification (the model operates on visual features only)
- Real-time streaming inference without pre-extraction of CNN features