Skip to main content
The production system loads all four checkpoints simultaneously and averages their softmax probability outputs before taking the argmax. This simple averaging ensemble consistently outperforms any single model on the 412-video test set.

Why Four Models?

Each checkpoint is the best validation-accuracy snapshot from an independently seeded training run. Because the runs use different random seeds (42, 43, 44, 45), the models converge to different local minima in weight space and make partially uncorrelated errors. Averaging their probability outputs reduces variance without increasing bias — the standard rationale for ensemble methods. Key observations from the four checkpoints:
  • Models 2, 3, and 4 converged to ~91–92% validation accuracy in epochs 40–43.
  • Model 1 stopped much earlier (epoch 52 — patience exhausted at a lower accuracy of ~72.7%), yet it still contributes complementary signal to the ensemble because its error distribution differs from the other three.
Even a weaker model can improve ensemble performance if its errors are not perfectly correlated with the stronger models’ errors. Model 1’s low Animation F1 (1.1%) is offset by strong Gaming (66.1%) and Natural_Content (96.7%) scores, which can reinforce confident correct predictions from the other models.

Per-Checkpoint Metrics

All values are taken directly from configuration_analysis.json, measured on the 412-video test split.

Validation Accuracy and Weighted F1

CheckpointEpochVal AccuracyWeighted F1
best_ensemble_model_1.pt5272.73%64.76%
best_ensemble_model_2.pt4392.13%92.15%
best_ensemble_model_3.pt4291.86%91.87%
best_ensemble_model_4.pt4091.86%91.88%

Per-Class F1 Scores (%)

CheckpointAnimationFlat_ContentGamingNatural_Content
best_ensemble_model_1.pt1.1093.4766.0696.69
best_ensemble_model_2.pt86.4696.7487.8897.52
best_ensemble_model_3.pt87.6795.4788.0096.38
best_ensemble_model_4.pt88.0295.2488.4895.77
Animation is the hardest class across all four checkpoints. Model 1’s near-zero Animation F1 (1.10%) indicates it almost always misclassifies Animation clips at the epoch it was saved — likely because it stopped before the model had learned to distinguish Animation’s visual features from Flat_Content.

Why Model 1 Performs Differently

Model 1 stopped at epoch 52 with a validation accuracy of 72.73% — far below the ~91–92% achieved by models 2–4. Several factors explain this:
Each training run uses patience=25. If validation accuracy does not improve for 25 consecutive epochs, training stops. Model 1’s training stagnated early, likely due to the random seed (torch.manual_seed(42)) producing an initialization that required more epochs to escape a poor loss basin — but patience ran out first.
Because Model 1’s probability outputs are very confident for Natural_Content and Flat_Content but near-random for Animation, the ensemble effectively down-weights its vote on ambiguous Animation clips through averaging. The three stronger models’ high Animation F1 (~87–88%) compensates.Model 1 still adds value on Natural_Content predictions (96.69% F1), where its high confidence aligns with the other models and reinforces the ensemble’s correct decisions.
The checkpoint is saved as the best model seen during training — i.e., the epoch with the highest validation accuracy observed. For Model 1, that was epoch 52 at 72.73%. Even though training could have been restarted with a different seed, the checkpoint is included as-is to capture the diversity it provides.

Shared Model Configuration

All four checkpoints store their configuration under the model_config key in the .pt file. The values are identical across all four:
{
  "feature_dim": 1280,
  "hidden_dim": 768,
  "num_classes": 4,
  "num_lstm_layers": 4,
  "num_attention_heads": 12,
  "dropout": 0.4,
  "bidirectional": true
}
ParameterValue
feature_dim1280 (EfficientNet-V2-S output)
hidden_dim768
num_classes4
num_lstm_layers4
num_attention_heads12
dropout0.4
bidirectionaltrue

Shared Training Configuration

{
  "learning_rate": 0.001,
  "batch_size": 48,
  "num_epochs": 150,
  "patience": 25
}
ParameterValue
learning_rate0.001
batch_size48
num_epochs150 (max; early stopping may apply)
patience25

Inference Code

The SingleVideoClassifier.load_models() method in test_already_extracted.py reads each checkpoint, reconstructs the model architecture from the stored model_config, loads the state dict, and switches the model to eval mode.
def load_models(self):
    """Load all trained models"""
    print(f"Loading {len(self.checkpoint_paths)} model(s)...\n")
    
    for i, checkpoint_path in enumerate(self.checkpoint_paths, 1):
        print(f"   Loading model {i}/{len(self.checkpoint_paths)}: {checkpoint_path.name}")
        
        checkpoint = torch.load(checkpoint_path, map_location=self.device, weights_only=False)
        
        # Get model configuration (same as testing script)
        config = checkpoint.get('model_config', checkpoint.get('config', {}))
        
        feature_dim          = config.get('feature_dim', 1280)
        hidden_dim           = config.get('hidden_dim', 768)
        num_classes          = config.get('num_classes', 4)
        num_lstm_layers      = config.get('num_lstm_layers', 4)
        num_attention_heads  = config.get('num_attention_heads', 12)
        dropout              = config.get('dropout', 0.4)
        bidirectional        = config.get('bidirectional', True)
        
        # Store for first model
        if i == 1:
            self.feature_dim = feature_dim
            self.num_classes = num_classes
            self.class_names = ['Animation', 'Flat_Content', 'Gaming', 'Natural_Content']

        # Initialize model
        model = SuperEnhancedTemporalModel(
            feature_dim=feature_dim,
            hidden_dim=hidden_dim,
            num_classes=num_classes,
            num_lstm_layers=num_lstm_layers,
            num_attention_heads=num_attention_heads,
            dropout=dropout,
            bidirectional=bidirectional
        ).to(self.device)
        
        model.load_state_dict(checkpoint['model_state_dict'])
        model.eval()
        
        self.models.append(model)
        
        # Display info
        if 'best_val_acc' in checkpoint:
            print(f"      Val accuracy: {checkpoint['best_val_acc']:.2f}%")
    
    print(f"\n✓ All {len(self.models)} model(s) loaded successfully")
    print(f"   Feature dim: {self.feature_dim}")
    print(f"   Classes: {self.class_names}\n")

Raw Metrics (JSON)

The full structured metrics for all four checkpoints as they appear in configuration_analysis.json:
[
  {
    "checkpoint_name": "best_ensemble_model_1.pt",
    "epoch": 52,
    "best_val_acc": 72.72727272727273,
    "metrics": {
      "accuracy": 72.72727272727273,
      "f1_weighted": 64.76018767151395,
      "f1_per_class": [1.1049723756906076, 93.47258485639686, 66.05839416058394, 96.68508287292819]
    }
  },
  {
    "checkpoint_name": "best_ensemble_model_2.pt",
    "epoch": 43,
    "best_val_acc": 92.13025780189959,
    "metrics": {
      "accuracy": 92.13025780189959,
      "f1_weighted": 92.14916021772333,
      "f1_per_class": [86.45533141210376, 96.73913043478261, 87.87878787878788, 97.52066115702479]
    }
  },
  {
    "checkpoint_name": "best_ensemble_model_3.pt",
    "epoch": 42,
    "best_val_acc": 91.85888738127544,
    "metrics": {
      "accuracy": 91.85888738127544,
      "f1_weighted": 91.87243635919067,
      "f1_per_class": [87.67123287671232, 95.46666666666667, 88.0, 96.37883008356548]
    }
  },
  {
    "checkpoint_name": "best_ensemble_model_4.pt",
    "epoch": 40,
    "best_val_acc": 91.85888738127544,
    "metrics": {
      "accuracy": 91.85888738127544,
      "f1_weighted": 91.87549831706308,
      "f1_per_class": [88.02228412256267, 95.23809523809523, 88.48167539267016, 95.77464788732395]
    }
  }
]

Build docs developers (and LLMs) love