SuperEnhancedTemporalModel (defined in model_train_new.py, line 614) is designed to learn which frames are informative and how they relate across time—tasks that simple mean/max pooling cannot handle.
Why BiLSTM + Attention?
Bidirectional LSTM
A standard LSTM only sees past context. A BiLSTM processes the sequence in both directions, allowing each time step to be informed by future frames as well. For video classification, this is critical: the end of a clip often disambiguates content type.
Multi-Head Self-Attention
LSTMs struggle with very long-range dependencies. Self-attention lets every frame directly attend to every other frame regardless of distance. The 12-head configuration allows the model to simultaneously track multiple types of temporal patterns.
Residual + LayerNorm
The attention output is added back to the LSTM output (residual connection) and normalized. This stabilizes gradients during training across 4 LSTM layers and prevents representation collapse.
Attention Pooling
Rather than naively averaging all frames, a learned scoring function assigns higher weight to the most discriminative moments before collapsing the sequence to a single vector.
Full Class Definition
Layer-by-Layer Breakdown
Input Projection
The raw CNN feature vectors (1280-d from EfficientNet-V2) are projected to the model’s internal dimension:
LayerNorm is applied before activation rather than after (pre-norm style), which improves training stability for deep recurrent networks. Dropout at half the standard rate (0.4 × 0.5 = 0.2) is light here to preserve input information.4-Layer Bidirectional LSTM
The projected sequence is fed to a stacked BiLSTM:Variable-length sequences are packed with
nn.utils.rnn.pack_padded_sequence to avoid computing over padding tokens, then unpacked after the LSTM. A GPU error recovery block retries the forward pass once on CUDA errors before propagating.Multi-Head Self-Attention
The full LSTM output sequence is passed through self-attention with a residual connection:With 12 attention heads and
embed_dim=1536, each head operates on a 128-dimensional subspace (1536 / 12 = 128). The residual connection ensures gradients flow back through the LSTM even when the attention weights are near-uniform early in training.Attention Pooling
A learned scoring MLP collapses the sequence into a single fixed-size vector:The pooling network (
1536 → 768 → 1) uses Tanh activation, which naturally bounds scores before softmax. This prevents any single frame from dominating with an excessively large score.Classifier MLP
The pooled vector is passed through a four-layer MLP with decreasing width:
LayerNorm after each linear layer normalizes activations independently of batch statistics, which is important for variable-length padded sequences. The final layer produces raw logits; softmax is applied at inference time.The forward() Method
Dimension Trace
| Stage | Tensor Shape | Notes |
|---|---|---|
| Input features | [B, T, 1280] | From HDF5, variable T |
| After input projection | [B, T, 768] | Linear + norm |
| After BiLSTM | [B, T, 1536] | 768 × 2 directions |
| After self-attention | [B, T, 1536] | Same shape, residual |
| After attention pooling | [B, 1536] | Sequence collapsed |
| After classifier | [B, 4] | Raw logits |
Weight Initialization
AllLinear layers use Xavier uniform initialization; LayerNorm layers initialize weight=1 and bias=0:
Training Configuration
| Hyperparameter | Value |
|---|---|
| Loss function | Focal Loss with label smoothing (γ=2.0, smoothing=0.1) |
| Optimizer | AdamW (lr=1e-3, weight_decay=5e-4, betas=(0.9, 0.999)) |
| Scheduler | CosineAnnealingWarmRestarts (T_0=20, T_mult=2, η_min=1e-6) |
| Gradient clipping | max norm = 1.0 |
| Batch size | 48 |
| Max epochs | 150 |
| Early stopping patience | 25 epochs |
| Dropout | 0.4 (inter-LSTM layers, MLP), 0.2 (input projection, pooling) |