The production system loads all four checkpoints simultaneously and averages their softmax probability outputs before taking the argmax. This simple averaging ensemble consistently outperforms any single model on the 412-video test set.
Why Four Models?
Each checkpoint is the best validation-accuracy snapshot from an independently seeded training run. Because the runs use different random seeds (42, 43, 44, 45), the models converge to different local minima in weight space and make partially uncorrelated errors. Averaging their probability outputs reduces variance without increasing bias — the standard rationale for ensemble methods.
Key observations from the four checkpoints:
- Models 2, 3, and 4 converged to ~91–92% validation accuracy in epochs 40–43.
- Model 1 stopped much earlier (epoch 52 — patience exhausted at a lower accuracy of ~72.7%), yet it still contributes complementary signal to the ensemble because its error distribution differs from the other three.
Per-Checkpoint Metrics
All values are taken directly fromconfiguration_analysis.json, measured on the 412-video test split.
Validation Accuracy and Weighted F1
| Checkpoint | Epoch | Val Accuracy | Weighted F1 |
|---|---|---|---|
best_ensemble_model_1.pt | 52 | 72.73% | 64.76% |
best_ensemble_model_2.pt | 43 | 92.13% | 92.15% |
best_ensemble_model_3.pt | 42 | 91.86% | 91.87% |
best_ensemble_model_4.pt | 40 | 91.86% | 91.88% |
Per-Class F1 Scores (%)
| Checkpoint | Animation | Flat_Content | Gaming | Natural_Content |
|---|---|---|---|---|
best_ensemble_model_1.pt | 1.10 | 93.47 | 66.06 | 96.69 |
best_ensemble_model_2.pt | 86.46 | 96.74 | 87.88 | 97.52 |
best_ensemble_model_3.pt | 87.67 | 95.47 | 88.00 | 96.38 |
best_ensemble_model_4.pt | 88.02 | 95.24 | 88.48 | 95.77 |
Animation is the hardest class across all four checkpoints. Model 1’s near-zero Animation F1 (1.10%) indicates it almost always misclassifies Animation clips at the epoch it was saved — likely because it stopped before the model had learned to distinguish Animation’s visual features from Flat_Content.
Why Model 1 Performs Differently
Model 1 stopped at epoch 52 with a validation accuracy of 72.73% — far below the ~91–92% achieved by models 2–4. Several factors explain this:Early patience exhaustion
Early patience exhaustion
Each training run uses
patience=25. If validation accuracy does not improve for 25 consecutive epochs, training stops. Model 1’s training stagnated early, likely due to the random seed (torch.manual_seed(42)) producing an initialization that required more epochs to escape a poor loss basin — but patience ran out first.Impact on ensemble
Impact on ensemble
Because Model 1’s probability outputs are very confident for Natural_Content and Flat_Content but near-random for Animation, the ensemble effectively down-weights its vote on ambiguous Animation clips through averaging. The three stronger models’ high Animation F1 (~87–88%) compensates.Model 1 still adds value on Natural_Content predictions (96.69% F1), where its high confidence aligns with the other models and reinforces the ensemble’s correct decisions.
Checkpoint selection
Checkpoint selection
The checkpoint is saved as the best model seen during training — i.e., the epoch with the highest validation accuracy observed. For Model 1, that was epoch 52 at 72.73%. Even though training could have been restarted with a different seed, the checkpoint is included as-is to capture the diversity it provides.
Shared Model Configuration
All four checkpoints store their configuration under themodel_config key in the .pt file. The values are identical across all four:
| Parameter | Value |
|---|---|
feature_dim | 1280 (EfficientNet-V2-S output) |
hidden_dim | 768 |
num_classes | 4 |
num_lstm_layers | 4 |
num_attention_heads | 12 |
dropout | 0.4 |
bidirectional | true |
Shared Training Configuration
| Parameter | Value |
|---|---|
learning_rate | 0.001 |
batch_size | 48 |
num_epochs | 150 (max; early stopping may apply) |
patience | 25 |
Inference Code
- Loading Checkpoints
- Ensemble Averaging
- Flask App Configuration
The
SingleVideoClassifier.load_models() method in test_already_extracted.py reads each checkpoint, reconstructs the model architecture from the stored model_config, loads the state dict, and switches the model to eval mode.Raw Metrics (JSON)
The full structured metrics for all four checkpoints as they appear inconfiguration_analysis.json: