Skip to main content
This model is the core temporal classifier in the NVIDIA Video Classification Project. It operates on pre-extracted frame-level CNN features and produces one of four content-category predictions: Animation, Flat_Content, Gaming, or Natural_Content.

Overview

SuperEnhancedTemporalModel is a sequence-to-label neural network designed for video content classification. Given a sequence of per-frame feature vectors (extracted by a pretrained CNN backbone such as EfficientNet-V2-S), it outputs a probability distribution over four content categories. The architecture combines:
  • A linear input projection to normalize feature space dimensionality
  • A deep bidirectional LSTM to model temporal dynamics in both directions
  • Multi-head self-attention with a residual connection to focus on the most discriminative video segments
  • Attention pooling to aggregate the sequence into a single vector
  • A deep MLP classifier with progressive width reduction and regularization

Target Task

Four-class video content classification: Animation, Flat_Content, Gaming, Natural_Content

Input

Variable-length sequence of CNN frame features — shape [T, 1280]

Output

Class logits of shape [batch, 4]; apply softmax for probabilities

Performance Target

>95% validation accuracy with ensemble + TTA

Architecture Details

The full forward pass proceeds through five sequential stages.
1

Input Projection

A 3-layer MLP normalizes the raw CNN features from feature_dim=1280 into the model’s internal hidden_dim=768 space, applying regularization before any sequence modeling.
Linear(1280 → 768) → LayerNorm(768) → ReLU → Dropout(0.2)
Dropout here is dropout * 0.5 = 0.2 (half the main dropout rate), keeping the projection lighter.
2

Bidirectional LSTM

A 4-layer stacked BiLSTM processes the projected sequence in both temporal directions, capturing forward context (motion build-up) and backward context (motion resolution).
LSTM(
    input_size  = 768,
    hidden_size = 768,
    num_layers  = 4,
    dropout     = 0.4,      # applied between LSTM layers
    bidirectional = True
)
# Output dim: 768 * 2 = 1536
The concatenation of forward and backward hidden states produces an output dimensionality of 1536 at each timestep.
3

Multi-Head Self-Attention + Residual

Self-attention lets every timestep attend to every other timestep, enabling the model to discover long-range temporal dependencies. A residual connection and LayerNorm stabilize training.
# Attention
MultiheadAttention(embed_dim=1536, num_heads=12, dropout=0.4)

# Residual + Norm
attended  = Dropout(attended_output)
attended  = LayerNorm(lstm_out + attended)   # residual connection
With embed_dim=1536 and num_heads=12, each attention head operates on a 128-dimensional subspace.
4

Attention Pooling

Instead of naive mean- or max-pooling, a learned scoring network assigns importance weights to each timestep and computes a weighted sum, collapsing [B, T, 1536][B, 1536].
# Scoring network
Linear(1536 → 768) → LayerNorm(768) → Tanh → Dropout(0.2) → Linear(768 → 1)

# Weighted aggregation
weights = softmax(scores, dim=1)          # [B, T, 1]
pooled  = (attended * weights).sum(dim=1) # [B, 1536]
5

Classifier MLP

A four-block MLP progressively compresses the 1536-dim pooled representation to num_classes=4 logits. Each hidden block applies LayerNorm, ReLU, and Dropout.
Linear(1536 → 768) → LayerNorm(768) → ReLU → Dropout(0.4)
Linear(768  → 512) → LayerNorm(512) → ReLU → Dropout(0.4)
Linear(512  → 256) → LayerNorm(256) → ReLU → Dropout(0.2)
Linear(256  → 4)                                            # logits
The final dropout before the last linear is dropout * 0.5 = 0.2.

Hyperparameters

Model Configuration

ParameterDefaultDescription
feature_dim2048 (train) / 1280 (ensemble)Input CNN feature dimension
hidden_dim768LSTM hidden size and internal projection dim
num_classes4Number of output classes
num_lstm_layers4Number of stacked BiLSTM layers
num_attention_heads12Number of self-attention heads
dropout0.4Base dropout rate (halved in projection and final MLP block)
bidirectionalTrueWhether the LSTM is bidirectional
The ensemble checkpoints (best_ensemble_model_*.pt) were saved with feature_dim=1280, matching the EfficientNet-V2-S backbone used during feature extraction. The default value of 2048 in the class signature reflects a ResNet backbone default and is overridden at instantiation time.

Training Configuration

ParameterValueDescription
learning_rate0.001Initial AdamW learning rate
batch_size48Videos per batch
num_epochs150Maximum training epochs
patience25Early stopping patience (epochs without val improvement)
weight_decay5e-4AdamW L2 regularization
gradient_clip1.0Max gradient norm
schedulerCosineAnnealingWarmRestartsT_0=20, T_mult=2, eta_min=1e-6
loss_fnFocalLossgamma=2.0, label smoothing 0.1, class-weighted alpha

Weight Initialization

All weights are initialized in _initialize_weights(), called at the end of __init__:
Module typeWeight initBias init
nn.LinearXavier uniform (xavier_uniform_)Constant 0
nn.LayerNormConstant 1Constant 0
LSTM weights use PyTorch defaults (uniform in [-1/√H, 1/√H]).

Full Model Code

class SuperEnhancedTemporalModel(nn.Module):
    """Enhanced model with increased capacity for >95% accuracy"""
    
    def __init__(self, feature_dim=2048, hidden_dim=768, num_classes=4,
                 num_lstm_layers=4, num_attention_heads=12, dropout=0.4,
                 bidirectional=True):
        super().__init__()
        
        self.feature_dim = feature_dim
        self.hidden_dim = hidden_dim
        self.num_classes = num_classes
        
        # Enhanced input projection
        self.input_projection = nn.Sequential(
            nn.Linear(feature_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout * 0.5)
        )
        
        # Deeper BiLSTM
        self.lstm = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=num_lstm_layers,
            batch_first=True,
            dropout=dropout if num_lstm_layers > 1 else 0,
            bidirectional=bidirectional
        )
        
        lstm_output_dim = hidden_dim * 2 if bidirectional else hidden_dim
        
        # Multi-head self-attention
        self.attention = nn.MultiheadAttention(
            embed_dim=lstm_output_dim,
            num_heads=num_attention_heads,
            dropout=dropout,
            batch_first=True
        )
        
        self.attention_norm = nn.LayerNorm(lstm_output_dim)
        self.attention_dropout = nn.Dropout(dropout)
        
        # Enhanced attention pooling
        self.attention_pooling = nn.Sequential(
            nn.Linear(lstm_output_dim, lstm_output_dim // 2),
            nn.LayerNorm(lstm_output_dim // 2),
            nn.Tanh(),
            nn.Dropout(dropout * 0.5),
            nn.Linear(lstm_output_dim // 2, 1)
        )
        
        # Deeper classifier
        self.classifier = nn.Sequential(
            nn.Linear(lstm_output_dim, 768),
            nn.LayerNorm(768),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(768, 512),
            nn.LayerNorm(512),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(512, 256),
            nn.LayerNorm(256),
            nn.ReLU(),
            nn.Dropout(dropout * 0.5),
            nn.Linear(256, num_classes)
        )
        
        self._initialize_weights()
    
    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.LayerNorm):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
    
    def forward(self, x, lengths=None): 
        # Project features
        x = self.input_projection(x)
        
        # Pack sequences
        if lengths is not None:
            x = nn.utils.rnn.pack_padded_sequence(
                x, lengths.cpu(), batch_first=True, enforce_sorted=False
            )
        
        # LSTM with error recovery
        try:
            lstm_out, _ = self.lstm(x)
        except RuntimeError as e:
            if "NVML_SUCCESS" in str(e) or "CUDA" in str(e):
                torch.cuda.empty_cache()
                torch.cuda.synchronize()
                time.sleep(1)
                lstm_out, _ = self.lstm(x)
            else:
                raise e

        # Unpack
        if lengths is not None:
            lstm_out, _ = nn.utils.rnn.pad_packed_sequence(
                lstm_out, batch_first=True
            )
        
        # Self-attention with residual
        attended, _ = self.attention(lstm_out, lstm_out, lstm_out)
        attended = self.attention_dropout(attended)
        attended = self.attention_norm(lstm_out + attended)
        
        # Attention pooling
        attention_weights = self.attention_pooling(attended)
        attention_weights = F.softmax(attention_weights, dim=1)
        pooled = (attended * attention_weights).sum(dim=1)
        
        # Classification
        output = self.classifier(pooled)
        
        return output

Data Flow Summary

Input:  [B, T, 1280]  — padded CNN frame features
   ↓ input_projection
        [B, T, 768]
   ↓ BiLSTM (4 layers)
        [B, T, 1536]
   ↓ MultiheadAttention + Residual + LayerNorm
        [B, T, 1536]
   ↓ Attention Pooling (learned weights → weighted sum)
        [B, 1536]
   ↓ Classifier MLP
Output: [B, 4]        — class logits

Limitations and Intended Use

This model was trained exclusively on YouTube-8M–derived clips resized to 256×256. Performance on video with significantly different visual statistics (e.g., vertical phone footage, ultra-low-resolution streams, highly compressed artifacts) has not been evaluated.
  • Classifying short video clips (up to ~73 frames) into one of four content categories
  • Backend inference in the Flask deployment, either as a standalone model or as part of the 4-model ensemble
  • Research into temporal sequence modeling for video understanding
  • Class imbalance sensitivity: Animation F1 is substantially lower than other classes (see ensemble checkpoints page). The focal loss and weighted sampler partially mitigate this but do not eliminate it.
  • Fixed backbone coupling: The model assumes 1280-dimensional input features from EfficientNet-V2-S (or a compatible backbone). Using a different backbone without retraining will degrade performance.
  • Short-clip assumption: The attention pooling is most effective within the ~73-frame window used during training. Very long videos should be segmented before inference.
  • No temporal localization: The model outputs a single label per clip; it cannot identify where in the clip the classifying content appears.
  • Fine-grained action recognition (the four classes are broad content categories)
  • Audio-based classification (the model operates on visual features only)
  • Real-time streaming inference without pre-extraction of CNN features

Build docs developers (and LLMs) love