SuperEnhancedTemporalModel

This model is the core temporal classifier in the NVIDIA Video Classification Project. It operates on pre-extracted frame-level CNN features and produces one of four content-category predictions: Animation, Flat_Content, Gaming, or Natural_Content.

Overview

SuperEnhancedTemporalModel is a sequence-to-label neural network designed for video content classification. Given a sequence of per-frame feature vectors (extracted by a pretrained CNN backbone such as EfficientNet-V2-S), it outputs a probability distribution over four content categories. The architecture combines:

A linear input projection to normalize feature space dimensionality
A deep bidirectional LSTM to model temporal dynamics in both directions
Multi-head self-attention with a residual connection to focus on the most discriminative video segments
Attention pooling to aggregate the sequence into a single vector
A deep MLP classifier with progressive width reduction and regularization

Target Task

Four-class video content classification: Animation, Flat_Content, Gaming, Natural_Content

Input

Variable-length sequence of CNN frame features — shape [T, 1280]

Output

Class logits of shape [batch, 4]; apply softmax for probabilities

Performance Target

>95% validation accuracy with ensemble + TTA

Architecture Details

The full forward pass proceeds through five sequential stages.

Input Projection

A 3-layer MLP normalizes the raw CNN features from feature_dim=1280 into the model’s internal hidden_dim=768 space, applying regularization before any sequence modeling.

Linear(1280 → 768) → LayerNorm(768) → ReLU → Dropout(0.2)

Dropout here is dropout * 0.5 = 0.2 (half the main dropout rate), keeping the projection lighter.

Bidirectional LSTM

A 4-layer stacked BiLSTM processes the projected sequence in both temporal directions, capturing forward context (motion build-up) and backward context (motion resolution).

LSTM(
    input_size  = 768,
    hidden_size = 768,
    num_layers  = 4,
    dropout     = 0.4,      # applied between LSTM layers
    bidirectional = True
)
# Output dim: 768 * 2 = 1536

The concatenation of forward and backward hidden states produces an output dimensionality of 1536 at each timestep.

Multi-Head Self-Attention + Residual

Self-attention lets every timestep attend to every other timestep, enabling the model to discover long-range temporal dependencies. A residual connection and LayerNorm stabilize training.

# Attention
MultiheadAttention(embed_dim=1536, num_heads=12, dropout=0.4)

# Residual + Norm
attended  = Dropout(attended_output)
attended  = LayerNorm(lstm_out + attended)   # residual connection

With embed_dim=1536 and num_heads=12, each attention head operates on a 128-dimensional subspace.

Attention Pooling

Instead of naive mean- or max-pooling, a learned scoring network assigns importance weights to each timestep and computes a weighted sum, collapsing [B, T, 1536] → [B, 1536].

# Scoring network
Linear(1536 → 768) → LayerNorm(768) → Tanh → Dropout(0.2) → Linear(768 → 1)

# Weighted aggregation
weights = softmax(scores, dim=1)          # [B, T, 1]
pooled  = (attended * weights).sum(dim=1) # [B, 1536]

Classifier MLP

A four-block MLP progressively compresses the 1536-dim pooled representation to num_classes=4 logits. Each hidden block applies LayerNorm, ReLU, and Dropout.

Linear(1536 → 768) → LayerNorm(768) → ReLU → Dropout(0.4)
Linear(768  → 512) → LayerNorm(512) → ReLU → Dropout(0.4)
Linear(512  → 256) → LayerNorm(256) → ReLU → Dropout(0.2)
Linear(256  → 4)                                            # logits

The final dropout before the last linear is dropout * 0.5 = 0.2.

Hyperparameters

Model Configuration

Parameter	Default	Description
`feature_dim`	`2048` (train) / `1280` (ensemble)	Input CNN feature dimension
`hidden_dim`	`768`	LSTM hidden size and internal projection dim
`num_classes`	`4`	Number of output classes
`num_lstm_layers`	`4`	Number of stacked BiLSTM layers
`num_attention_heads`	`12`	Number of self-attention heads
`dropout`	`0.4`	Base dropout rate (halved in projection and final MLP block)
`bidirectional`	`True`	Whether the LSTM is bidirectional

The ensemble checkpoints (best_ensemble_model_*.pt) were saved with feature_dim=1280, matching the EfficientNet-V2-S backbone used during feature extraction. The default value of 2048 in the class signature reflects a ResNet backbone default and is overridden at instantiation time.

Training Configuration

Parameter	Value	Description
`learning_rate`	`0.001`	Initial AdamW learning rate
`batch_size`	`48`	Videos per batch
`num_epochs`	`150`	Maximum training epochs
`patience`	`25`	Early stopping patience (epochs without val improvement)
`weight_decay`	`5e-4`	AdamW L2 regularization
`gradient_clip`	`1.0`	Max gradient norm
`scheduler`	`CosineAnnealingWarmRestarts`	`T_0=20`, `T_mult=2`, `eta_min=1e-6`
`loss_fn`	`FocalLoss`	`gamma=2.0`, label smoothing `0.1`, class-weighted alpha

Weight Initialization

All weights are initialized in _initialize_weights(), called at the end of __init__:

Module type	Weight init	Bias init
`nn.Linear`	Xavier uniform (`xavier_uniform_`)	Constant `0`
`nn.LayerNorm`	Constant `1`	Constant `0`

LSTM weights use PyTorch defaults (uniform in [-1/√H, 1/√H]).

Full Model Code

class SuperEnhancedTemporalModel(nn.Module):
    """Enhanced model with increased capacity for >95% accuracy"""
    
    def __init__(self, feature_dim=2048, hidden_dim=768, num_classes=4,
                 num_lstm_layers=4, num_attention_heads=12, dropout=0.4,
                 bidirectional=True):
        super().__init__()
        
        self.feature_dim = feature_dim
        self.hidden_dim = hidden_dim
        self.num_classes = num_classes
        
        # Enhanced input projection
        self.input_projection = nn.Sequential(
            nn.Linear(feature_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout * 0.5)
        )
        
        # Deeper BiLSTM
        self.lstm = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=num_lstm_layers,
            batch_first=True,
            dropout=dropout if num_lstm_layers > 1 else 0,
            bidirectional=bidirectional
        )
        
        lstm_output_dim = hidden_dim * 2 if bidirectional else hidden_dim
        
        # Multi-head self-attention
        self.attention = nn.MultiheadAttention(
            embed_dim=lstm_output_dim,
            num_heads=num_attention_heads,
            dropout=dropout,
            batch_first=True
        )
        
        self.attention_norm = nn.LayerNorm(lstm_output_dim)
        self.attention_dropout = nn.Dropout(dropout)
        
        # Enhanced attention pooling
        self.attention_pooling = nn.Sequential(
            nn.Linear(lstm_output_dim, lstm_output_dim // 2),
            nn.LayerNorm(lstm_output_dim // 2),
            nn.Tanh(),
            nn.Dropout(dropout * 0.5),
            nn.Linear(lstm_output_dim // 2, 1)
        )
        
        # Deeper classifier
        self.classifier = nn.Sequential(
            nn.Linear(lstm_output_dim, 768),
            nn.LayerNorm(768),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(768, 512),
            nn.LayerNorm(512),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(512, 256),
            nn.LayerNorm(256),
            nn.ReLU(),
            nn.Dropout(dropout * 0.5),
            nn.Linear(256, num_classes)
        )
        
        self._initialize_weights()
    
    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.LayerNorm):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
    
    def forward(self, x, lengths=None): 
        # Project features
        x = self.input_projection(x)
        
        # Pack sequences
        if lengths is not None:
            x = nn.utils.rnn.pack_padded_sequence(
                x, lengths.cpu(), batch_first=True, enforce_sorted=False
            )
        
        # LSTM with error recovery
        try:
            lstm_out, _ = self.lstm(x)
        except RuntimeError as e:
            if "NVML_SUCCESS" in str(e) or "CUDA" in str(e):
                torch.cuda.empty_cache()
                torch.cuda.synchronize()
                time.sleep(1)
                lstm_out, _ = self.lstm(x)
            else:
                raise e

        # Unpack
        if lengths is not None:
            lstm_out, _ = nn.utils.rnn.pad_packed_sequence(
                lstm_out, batch_first=True
            )
        
        # Self-attention with residual
        attended, _ = self.attention(lstm_out, lstm_out, lstm_out)
        attended = self.attention_dropout(attended)
        attended = self.attention_norm(lstm_out + attended)
        
        # Attention pooling
        attention_weights = self.attention_pooling(attended)
        attention_weights = F.softmax(attention_weights, dim=1)
        pooled = (attended * attention_weights).sum(dim=1)
        
        # Classification
        output = self.classifier(pooled)
        
        return output

Data Flow Summary

Input:  [B, T, 1280]  — padded CNN frame features
   ↓ input_projection
        [B, T, 768]
   ↓ BiLSTM (4 layers)
        [B, T, 1536]
   ↓ MultiheadAttention + Residual + LayerNorm
        [B, T, 1536]
   ↓ Attention Pooling (learned weights → weighted sum)
        [B, 1536]
   ↓ Classifier MLP
Output: [B, 4]        — class logits

Limitations and Intended Use

This model was trained exclusively on YouTube-8M–derived clips resized to 256×256. Performance on video with significantly different visual statistics (e.g., vertical phone footage, ultra-low-resolution streams, highly compressed artifacts) has not been evaluated.

Intended use

Classifying short video clips (up to ~73 frames) into one of four content categories
Backend inference in the Flask deployment, either as a standalone model or as part of the 4-model ensemble
Research into temporal sequence modeling for video understanding

Known limitations

Class imbalance sensitivity: Animation F1 is substantially lower than other classes (see ensemble checkpoints page). The focal loss and weighted sampler partially mitigate this but do not eliminate it.
Fixed backbone coupling: The model assumes 1280-dimensional input features from EfficientNet-V2-S (or a compatible backbone). Using a different backbone without retraining will degrade performance.
Short-clip assumption: The attention pooling is most effective within the ~73-frame window used during training. Very long videos should be segmented before inference.
No temporal localization: The model outputs a single label per clip; it cannot identify where in the clip the classifying content appears.

Out-of-scope uses

Fine-grained action recognition (the four classes are broad content categories)
Audio-based classification (the model operates on visual features only)
Real-time streaming inference without pre-extraction of CNN features

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Overview

Target Task

Input

Output

Performance Target

Architecture Details

Hyperparameters

Model Configuration

Training Configuration

Weight Initialization

Full Model Code

Data Flow Summary

Limitations and Intended Use

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​Overview

Target Task

Input

Output

Performance Target

​Architecture Details

​Hyperparameters

​Model Configuration

​Training Configuration

​Weight Initialization

​Full Model Code

​Data Flow Summary

​Limitations and Intended Use

Build docs developers (and LLMs) love

Overview

Architecture Details

Hyperparameters

Model Configuration

Training Configuration

Weight Initialization

Full Model Code

Data Flow Summary

Limitations and Intended Use