Why Ensemble?
Each model in the ensemble is trained from a different random seed (42 + i), producing a different local minimum in weight space. When predictions from multiple diverse models are averaged, individual errors tend to cancel out—especially on ambiguous samples near decision boundaries.
Reduced Variance
A single model’s confidence on a hard sample may fluctuate. Averaging 4 models produces a smoother, more reliable probability estimate.
Error Cancellation
If model A misclassifies a Gaming clip as Animation but models B, C, and D are correct, the average still predicts Gaming.
No Extra Training Cost
Once trained, ensemble inference requires only one forward pass per model. With pre-extracted features, this is negligible overhead.
Different Seeds
Seeds 42, 43, 44, 45 initialize weights differently and produce different dropout masks, leading to genuinely diverse feature representations.
The 4 Checkpoints
Fromconfiguration_analysis.json, the four production checkpoints share identical architecture but differ in their training trajectories:
| Checkpoint | Best Epoch | Best Val Acc | Weighted F1 |
|---|---|---|---|
best_ensemble_model_1.pt | 52 | 72.73% | 64.76% |
best_ensemble_model_2.pt | 43 | 92.13% | 92.15% |
best_ensemble_model_3.pt | 42 | 91.86% | 91.87% |
best_ensemble_model_4.pt | 40 | 91.86% | 91.88% |
Model 1’s lower individual accuracy is not necessarily a weakness in an ensemble. It may have learned a different decision boundary than models 2–4, contributing complementary information. The ensemble’s combined accuracy exceeds any individual model.
Model Ensemble: Averaging Softmax Probabilities
Each model produces a softmax probability vector of shape[4]. These are stacked and averaged:
Test-Time Augmentation (TTA)
TTA generates multiple views of the same video at inference time by applying temporal transformations to the pre-extracted feature sequence. The model never sees pixels during TTA—only the feature vectors are manipulated, making it computationally cheap.The 4 TTA Modes
Mode 1: Original
Mode 1: Original
The unmodified feature sequence is passed through the ensemble as-is. This is the baseline prediction.
Mode 2: Reverse (flip temporal order)
Mode 2: Reverse (flip temporal order)
The frame sequence is reversed end-to-end, simulating a video played backwards. Content type (Animation, Gaming, etc.) is invariant to temporal direction, so this is a valid augmentation.
Mode 3: Speed Up (subsample half the frames)
Mode 3: Speed Up (subsample half the frames)
Every other frame is dropped by sampling Minimum length guard (
T/2 indices uniformly across the sequence. This simulates a 2× speed-up.> 10 frames) prevents degenerate sequences that are too short for the LSTM.Mode 4: Speed Down (interpolate to 1.5× frames)
Mode 4: Speed Down (interpolate to 1.5× frames)
The sequence is expanded to
1.5 × T frames by repeating existing frame features at uniformly spaced positions. This simulates a 0.67× slowdown..clamp() ensures no index exceeds the original sequence length due to floating-point rounding.The predict_with_tta() Method
The full TTA function fromtest_already_extracted.py:
predict_standard already averages all 4 model checkpoints, so the final prediction averages up to 4 TTA modes × 4 models = 16 forward passes.
TTA During Training Evaluation
The same four modes are used in thetest_time_augmentation method of EnhancedTemporalModelTrainer during the training pipeline:
EnhancedPreExtractedFeaturesDataset.__getitem__ applies the transformation inline:
Performance Impact
| Configuration | Accuracy |
|---|---|
| Single model, no TTA | ~93% |
| Single model + TTA | ~94–95% |
| 4-model ensemble + TTA | ~95%+ |
Test Accuracy: ~93% (95% with TTA)
Balanced class-wise F1 scores (>95% with ensemble + TTA)