Ensemble Checkpoints

The production system loads all four checkpoints simultaneously and averages their softmax probability outputs before taking the argmax. This simple averaging ensemble consistently outperforms any single model on the 412-video test set.

Why Four Models?

Each checkpoint is the best validation-accuracy snapshot from an independently seeded training run. Because the runs use different random seeds (42, 43, 44, 45), the models converge to different local minima in weight space and make partially uncorrelated errors. Averaging their probability outputs reduces variance without increasing bias — the standard rationale for ensemble methods. Key observations from the four checkpoints:

Models 2, 3, and 4 converged to ~91–92% validation accuracy in epochs 40–43.
Model 1 stopped much earlier (epoch 52 — patience exhausted at a lower accuracy of ~72.7%), yet it still contributes complementary signal to the ensemble because its error distribution differs from the other three.

Even a weaker model can improve ensemble performance if its errors are not perfectly correlated with the stronger models’ errors. Model 1’s low Animation F1 (1.1%) is offset by strong Gaming (66.1%) and Natural_Content (96.7%) scores, which can reinforce confident correct predictions from the other models.

Per-Checkpoint Metrics

All values are taken directly from configuration_analysis.json, measured on the 412-video test split.

Validation Accuracy and Weighted F1

Checkpoint	Epoch	Val Accuracy	Weighted F1
`best_ensemble_model_1.pt`	52	72.73%	64.76%
`best_ensemble_model_2.pt`	43	92.13%	92.15%
`best_ensemble_model_3.pt`	42	91.86%	91.87%
`best_ensemble_model_4.pt`	40	91.86%	91.88%

Per-Class F1 Scores (%)

Checkpoint	Animation	Flat_Content	Gaming	Natural_Content
`best_ensemble_model_1.pt`	1.10	93.47	66.06	96.69
`best_ensemble_model_2.pt`	86.46	96.74	87.88	97.52
`best_ensemble_model_3.pt`	87.67	95.47	88.00	96.38
`best_ensemble_model_4.pt`	88.02	95.24	88.48	95.77

Animation is the hardest class across all four checkpoints. Model 1’s near-zero Animation F1 (1.10%) indicates it almost always misclassifies Animation clips at the epoch it was saved — likely because it stopped before the model had learned to distinguish Animation’s visual features from Flat_Content.

Why Model 1 Performs Differently

Model 1 stopped at epoch 52 with a validation accuracy of 72.73% — far below the ~91–92% achieved by models 2–4. Several factors explain this:

Early patience exhaustion

Each training run uses patience=25. If validation accuracy does not improve for 25 consecutive epochs, training stops. Model 1’s training stagnated early, likely due to the random seed (torch.manual_seed(42)) producing an initialization that required more epochs to escape a poor loss basin — but patience ran out first.

Impact on ensemble

Because Model 1’s probability outputs are very confident for Natural_Content and Flat_Content but near-random for Animation, the ensemble effectively down-weights its vote on ambiguous Animation clips through averaging. The three stronger models’ high Animation F1 (~87–88%) compensates.Model 1 still adds value on Natural_Content predictions (96.69% F1), where its high confidence aligns with the other models and reinforces the ensemble’s correct decisions.

Checkpoint selection

The checkpoint is saved as the best model seen during training — i.e., the epoch with the highest validation accuracy observed. For Model 1, that was epoch 52 at 72.73%. Even though training could have been restarted with a different seed, the checkpoint is included as-is to capture the diversity it provides.

Shared Model Configuration

All four checkpoints store their configuration under the model_config key in the .pt file. The values are identical across all four:

{
  "feature_dim": 1280,
  "hidden_dim": 768,
  "num_classes": 4,
  "num_lstm_layers": 4,
  "num_attention_heads": 12,
  "dropout": 0.4,
  "bidirectional": true
}

Parameter	Value
`feature_dim`	`1280` (EfficientNet-V2-S output)
`hidden_dim`	`768`
`num_classes`	`4`
`num_lstm_layers`	`4`
`num_attention_heads`	`12`
`dropout`	`0.4`
`bidirectional`	`true`

Shared Training Configuration

{
  "learning_rate": 0.001,
  "batch_size": 48,
  "num_epochs": 150,
  "patience": 25
}

Parameter	Value
`learning_rate`	`0.001`
`batch_size`	`48`
`num_epochs`	`150` (max; early stopping may apply)
`patience`	`25`

Inference Code

Loading Checkpoints
Ensemble Averaging
Flask App Configuration

The SingleVideoClassifier.load_models() method in test_already_extracted.py reads each checkpoint, reconstructs the model architecture from the stored model_config, loads the state dict, and switches the model to eval mode.

def load_models(self):
    """Load all trained models"""
    print(f"Loading {len(self.checkpoint_paths)} model(s)...\n")
    
    for i, checkpoint_path in enumerate(self.checkpoint_paths, 1):
        print(f"   Loading model {i}/{len(self.checkpoint_paths)}: {checkpoint_path.name}")
        
        checkpoint = torch.load(checkpoint_path, map_location=self.device, weights_only=False)
        
        # Get model configuration (same as testing script)
        config = checkpoint.get('model_config', checkpoint.get('config', {}))
        
        feature_dim          = config.get('feature_dim', 1280)
        hidden_dim           = config.get('hidden_dim', 768)
        num_classes          = config.get('num_classes', 4)
        num_lstm_layers      = config.get('num_lstm_layers', 4)
        num_attention_heads  = config.get('num_attention_heads', 12)
        dropout              = config.get('dropout', 0.4)
        bidirectional        = config.get('bidirectional', True)
        
        # Store for first model
        if i == 1:
            self.feature_dim = feature_dim
            self.num_classes = num_classes
            self.class_names = ['Animation', 'Flat_Content', 'Gaming', 'Natural_Content']

        # Initialize model
        model = SuperEnhancedTemporalModel(
            feature_dim=feature_dim,
            hidden_dim=hidden_dim,
            num_classes=num_classes,
            num_lstm_layers=num_lstm_layers,
            num_attention_heads=num_attention_heads,
            dropout=dropout,
            bidirectional=bidirectional
        ).to(self.device)
        
        model.load_state_dict(checkpoint['model_state_dict'])
        model.eval()
        
        self.models.append(model)
        
        # Display info
        if 'best_val_acc' in checkpoint:
            print(f"      Val accuracy: {checkpoint['best_val_acc']:.2f}%")
    
    print(f"\n✓ All {len(self.models)} model(s) loaded successfully")
    print(f"   Feature dim: {self.feature_dim}")
    print(f"   Classes: {self.class_names}\n")

SingleVideoClassifier.predict_standard() runs a single forward pass through every loaded model and averages the resulting softmax probabilities. No reweighting is applied — all four checkpoints contribute equally.

def predict_standard(self, features):
    """Standard prediction (no TTA) - same as test_standard()"""
    all_model_predictions = []
    
    with torch.no_grad():
        features_batch = features.unsqueeze(0).to(self.device)  # [1, T, D]
        lengths = torch.tensor([features.shape[0]], device=self.device)
        
        for model in self.models:
            outputs = model(features_batch, lengths)  # [1, num_classes]
            probs = F.softmax(outputs, dim=1)
            all_model_predictions.append(probs.squeeze(0).cpu())
    
    # Ensemble: average predictions
    ensemble_probs = torch.stack(all_model_predictions).mean(dim=0)
    
    return ensemble_probs

The return value is a 1-D tensor of shape [4] containing averaged probabilities summing to 1.

The Flask application in app.py specifies which checkpoint files to load via the SELECTED_CHECKPOINTS list. Paths are resolved relative to the MODELS_DIR directory (defaulting to ./models_enhanced/ next to app.py).

# Selected checkpoints to use
SELECTED_CHECKPOINTS = [
    "best_ensemble_model_1.pt",
    "best_ensemble_model_2.pt",
    "best_ensemble_model_3.pt",
    "best_ensemble_model_4.pt",
]

To swap checkpoints, update this list with the desired .pt filenames. The get_classifier() singleton will re-initialize on next request if the module is reloaded, or after an application restart.

def get_classifier(device="cuda"):
    global _classifier
    if _classifier is None:
        paths = []
        for name in SELECTED_CHECKPOINTS:
            p = MODELS_DIR / name
            if p.exists():
                paths.append(str(p))
            else:
                app.logger.warning(f"Checkpoint not found: {p} — skipping")
        ...
        _classifier = user_module.SingleVideoClassifier(checkpoint_paths=paths, device=device_arg)
        _classifier.load_models()
    return _classifier

Raw Metrics (JSON)

The full structured metrics for all four checkpoints as they appear in configuration_analysis.json:

[
  {
    "checkpoint_name": "best_ensemble_model_1.pt",
    "epoch": 52,
    "best_val_acc": 72.72727272727273,
    "metrics": {
      "accuracy": 72.72727272727273,
      "f1_weighted": 64.76018767151395,
      "f1_per_class": [1.1049723756906076, 93.47258485639686, 66.05839416058394, 96.68508287292819]
    }
  },
  {
    "checkpoint_name": "best_ensemble_model_2.pt",
    "epoch": 43,
    "best_val_acc": 92.13025780189959,
    "metrics": {
      "accuracy": 92.13025780189959,
      "f1_weighted": 92.14916021772333,
      "f1_per_class": [86.45533141210376, 96.73913043478261, 87.87878787878788, 97.52066115702479]
    }
  },
  {
    "checkpoint_name": "best_ensemble_model_3.pt",
    "epoch": 42,
    "best_val_acc": 91.85888738127544,
    "metrics": {
      "accuracy": 91.85888738127544,
      "f1_weighted": 91.87243635919067,
      "f1_per_class": [87.67123287671232, 95.46666666666667, 88.0, 96.37883008356548]
    }
  },
  {
    "checkpoint_name": "best_ensemble_model_4.pt",
    "epoch": 40,
    "best_val_acc": 91.85888738127544,
    "metrics": {
      "accuracy": 91.85888738127544,
      "f1_weighted": 91.87549831706308,
      "f1_per_class": [88.02228412256267, 95.23809523809523, 88.48167539267016, 95.77464788732395]
    }
  }
]

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Why Four Models?

Per-Checkpoint Metrics

Validation Accuracy and Weighted F1

Per-Class F1 Scores (%)

Why Model 1 Performs Differently

Shared Model Configuration

Shared Training Configuration

Inference Code

Raw Metrics (JSON)

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​Why Four Models?

​Per-Checkpoint Metrics

​Validation Accuracy and Weighted F1

​Per-Class F1 Scores (%)

​Why Model 1 Performs Differently

​Shared Model Configuration

​Shared Training Configuration

​Inference Code

​Raw Metrics (JSON)

Build docs developers (and LLMs) love

Why Four Models?

Per-Checkpoint Metrics

Validation Accuracy and Weighted F1

Per-Class F1 Scores (%)

Why Model 1 Performs Differently

Shared Model Configuration

Shared Training Configuration

Inference Code

Raw Metrics (JSON)