Skip to main content
Once per-frame CNN features are extracted and saved, Stage 2 trains a dedicated temporal model on top of them. SuperEnhancedTemporalModel (defined in model_train_new.py, line 614) is designed to learn which frames are informative and how they relate across time—tasks that simple mean/max pooling cannot handle.

Why BiLSTM + Attention?

Bidirectional LSTM

A standard LSTM only sees past context. A BiLSTM processes the sequence in both directions, allowing each time step to be informed by future frames as well. For video classification, this is critical: the end of a clip often disambiguates content type.

Multi-Head Self-Attention

LSTMs struggle with very long-range dependencies. Self-attention lets every frame directly attend to every other frame regardless of distance. The 12-head configuration allows the model to simultaneously track multiple types of temporal patterns.

Residual + LayerNorm

The attention output is added back to the LSTM output (residual connection) and normalized. This stabilizes gradients during training across 4 LSTM layers and prevents representation collapse.

Attention Pooling

Rather than naively averaging all frames, a learned scoring function assigns higher weight to the most discriminative moments before collapsing the sequence to a single vector.

Full Class Definition

class SuperEnhancedTemporalModel(nn.Module):
    """Enhanced model with increased capacity for >95% accuracy"""
    
    def __init__(self, feature_dim=2048, hidden_dim=768, num_classes=4,
                 num_lstm_layers=4, num_attention_heads=12, dropout=0.4,
                 bidirectional=True):
        super().__init__()
        
        self.feature_dim = feature_dim
        self.hidden_dim = hidden_dim
        self.num_classes = num_classes
        
        # Enhanced input projection
        self.input_projection = nn.Sequential(
            nn.Linear(feature_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout * 0.5)
        )
        
        # Deeper BiLSTM
        self.lstm = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=num_lstm_layers,
            batch_first=True,
            dropout=dropout if num_lstm_layers > 1 else 0,
            bidirectional=bidirectional
        )
        
        lstm_output_dim = hidden_dim * 2 if bidirectional else hidden_dim
        
        # Multi-head self-attention
        self.attention = nn.MultiheadAttention(
            embed_dim=lstm_output_dim,
            num_heads=num_attention_heads,
            dropout=dropout,
            batch_first=True
        )
        
        self.attention_norm = nn.LayerNorm(lstm_output_dim)
        self.attention_dropout = nn.Dropout(dropout)
        
        # Enhanced attention pooling
        self.attention_pooling = nn.Sequential(
            nn.Linear(lstm_output_dim, lstm_output_dim // 2),
            nn.LayerNorm(lstm_output_dim // 2),
            nn.Tanh(),
            nn.Dropout(dropout * 0.5),
            nn.Linear(lstm_output_dim // 2, 1)
        )
        
        # Deeper classifier
        self.classifier = nn.Sequential(
            nn.Linear(lstm_output_dim, 768),
            nn.LayerNorm(768),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(768, 512),
            nn.LayerNorm(512),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(512, 256),
            nn.LayerNorm(256),
            nn.ReLU(),
            nn.Dropout(dropout * 0.5),
            nn.Linear(256, num_classes)
        )

Layer-by-Layer Breakdown

1

Input Projection

The raw CNN feature vectors (1280-d from EfficientNet-V2) are projected to the model’s internal dimension:
Linear(1280 → 768) → LayerNorm(768) → ReLU → Dropout(0.2)
LayerNorm is applied before activation rather than after (pre-norm style), which improves training stability for deep recurrent networks. Dropout at half the standard rate (0.4 × 0.5 = 0.2) is light here to preserve input information.
2

4-Layer Bidirectional LSTM

The projected sequence is fed to a stacked BiLSTM:
LSTM(
    input_size  = 768,
    hidden_size = 768,
    num_layers  = 4,
    bidirectional = True,
    dropout = 0.4     # applied between LSTM layers
)
→ output shape: [B, T, 1536]   (768 forward + 768 backward)
Variable-length sequences are packed with nn.utils.rnn.pack_padded_sequence to avoid computing over padding tokens, then unpacked after the LSTM. A GPU error recovery block retries the forward pass once on CUDA errors before propagating.
3

Multi-Head Self-Attention

The full LSTM output sequence is passed through self-attention with a residual connection:
attended, _ = self.attention(lstm_out, lstm_out, lstm_out)
attended = self.attention_dropout(attended)
attended = self.attention_norm(lstm_out + attended)  # residual
With 12 attention heads and embed_dim=1536, each head operates on a 128-dimensional subspace (1536 / 12 = 128). The residual connection ensures gradients flow back through the LSTM even when the attention weights are near-uniform early in training.
4

Attention Pooling

A learned scoring MLP collapses the sequence into a single fixed-size vector:
attention_weights = self.attention_pooling(attended)  # [B, T, 1]
attention_weights = F.softmax(attention_weights, dim=1)
pooled = (attended * attention_weights).sum(dim=1)    # [B, 1536]
The pooling network (1536 → 768 → 1) uses Tanh activation, which naturally bounds scores before softmax. This prevents any single frame from dominating with an excessively large score.
5

Classifier MLP

The pooled vector is passed through a four-layer MLP with decreasing width:
Linear(1536 → 768) → LayerNorm → ReLU → Dropout(0.4)
Linear(768  → 512) → LayerNorm → ReLU → Dropout(0.4)
Linear(512  → 256) → LayerNorm → ReLU → Dropout(0.2)
Linear(256  → 4)                             # logits
LayerNorm after each linear layer normalizes activations independently of batch statistics, which is important for variable-length padded sequences. The final layer produces raw logits; softmax is applied at inference time.

The forward() Method

def forward(self, x, lengths=None):
    # 1. Project features to hidden_dim
    x = self.input_projection(x)
    
    # 2. Pack sequences (skip padding tokens)
    if lengths is not None:
        x = nn.utils.rnn.pack_padded_sequence(
            x, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
    
    # 3. LSTM with GPU error recovery
    try:
        lstm_out, _ = self.lstm(x)
    except RuntimeError as e:
        if "NVML_SUCCESS" in str(e) or "CUDA" in str(e):
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
            time.sleep(1)
            lstm_out, _ = self.lstm(x)  # retry once
        else:
            raise e

    # 4. Unpack
    if lengths is not None:
        lstm_out, _ = nn.utils.rnn.pad_packed_sequence(
            lstm_out, batch_first=True
        )
    
    # 5. Self-attention with residual
    attended, _ = self.attention(lstm_out, lstm_out, lstm_out)
    attended = self.attention_dropout(attended)
    attended = self.attention_norm(lstm_out + attended)
    
    # 6. Attention pooling
    attention_weights = self.attention_pooling(attended)
    attention_weights = F.softmax(attention_weights, dim=1)
    pooled = (attended * attention_weights).sum(dim=1)
    
    # 7. Classification
    output = self.classifier(pooled)
    
    return output

Dimension Trace

StageTensor ShapeNotes
Input features[B, T, 1280]From HDF5, variable T
After input projection[B, T, 768]Linear + norm
After BiLSTM[B, T, 1536]768 × 2 directions
After self-attention[B, T, 1536]Same shape, residual
After attention pooling[B, 1536]Sequence collapsed
After classifier[B, 4]Raw logits

Weight Initialization

All Linear layers use Xavier uniform initialization; LayerNorm layers initialize weight=1 and bias=0:
def _initialize_weights(self):
    for m in self.modules():
        if isinstance(m, nn.Linear):
            nn.init.xavier_uniform_(m.weight)
            if m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.weight, 1)
            nn.init.constant_(m.bias, 0)

Training Configuration

HyperparameterValue
Loss functionFocal Loss with label smoothing (γ=2.0, smoothing=0.1)
OptimizerAdamW (lr=1e-3, weight_decay=5e-4, betas=(0.9, 0.999))
SchedulerCosineAnnealingWarmRestarts (T_0=20, T_mult=2, η_min=1e-6)
Gradient clippingmax norm = 1.0
Batch size48
Max epochs150
Early stopping patience25 epochs
Dropout0.4 (inter-LSTM layers, MLP), 0.2 (input projection, pooling)
Focal Loss with label smoothing is particularly effective here because the four video categories have unequal difficulty—Animation and Natural Content are easier to separate than Gaming vs. Flat Content. Focal Loss down-weights easy examples dynamically, forcing the model to focus on hard cases.

Build docs developers (and LLMs) love