Skip to main content

Category Overview

The classifier distinguishes four content categories sourced from YouTube-8M. Each category has distinct visual and temporal characteristics that affect classification difficulty.

Animation

Animated content spans a wide range of visual styles — from cel animation to 3D CGI — making it the hardest class to classify consistently.

Flat_Content

Screen recordings, slideshows, and presentation-style videos. Highly consistent low-motion patterns make this one of the easier classes.

Gaming

Gameplay footage with HUD elements and fast motion. Intermediate difficulty; visual patterns are consistent within genres but vary across games.

Natural_Content

Outdoor and real-world footage. Strong texture and motion cues make this the most reliably classified category.
Category-to-index mapping (as stored in test_features_multiscale.h5):
{"Animation": 0, "Flat_Content": 1, "Gaming": 2, "Natural_Content": 3}

Per-Class F1 Scores

Per-class F1 scores are extracted from precision_recall_fscore_support with average=None and scaled to percentage. The table below presents values from all four checkpoint files as recorded in configuration_analysis.json.
ClassModel 1 F1Model 2 F1Model 3 F1Model 4 F1Best
Animation1.10%86.46%87.67%88.02%88.02%
Flat_Content93.47%96.74%95.47%95.24%96.74%
Gaming66.06%87.88%88.00%88.48%88.48%
Natural_Content96.69%97.52%96.38%95.77%97.52%
These per-class F1 scores reflect validation set performance at the best checkpoint epoch for each model, not final test-set performance. The ensemble + TTA figures reported on the Performance Metrics page are higher because they combine all four models at inference time.

Model 1 Animation Anomaly

Model 1 produced an Animation F1 of 1.10% — effectively a complete failure on that class — while achieving reasonable F1 scores for the other three categories (93.47%, 66.06%, 96.69%).
An F1 of 1.10% on Animation means Model 1 predicted almost every Animation sample as a different class. This is a training-run collapse, not a data issue. Models 2, 3, and 4 all scored 86–88% on Animation, confirming the class is learnable.
This behavior illustrates why ensemble diversity matters:
Even though Model 1 collapsed on Animation, its predictions for Flat_Content (93.47%) and Natural_Content (96.69%) are competitive with the other models. When the four models’ softmax probabilities are averaged, Models 2–4 dominate for Animation samples (driving the Animation probability high), while Model 1 still contributes accurate signal for the classes it handles well.Removing Model 1 entirely would slightly reduce confidence on Flat_Content and Natural_Content predictions. The ensemble approach tolerates individual model weaknesses as long as at least one model produces a confident correct prediction per class.
The train_ensemble function in model_train_new.py trains each model with a different random seed:
for i in range(num_models):
    # Set different random seed for each model
    torch.manual_seed(42 + i)
    np.random.seed(42 + i)
    random.seed(42 + i)

    model, results = self.train_single_model(
        num_epochs=num_epochs,
        batch_size=batch_size,
        learning_rate=learning_rate,
        model_name=f'ensemble_model_{i}',
        ...
    )
Different seeds produce different weight initializations and data-augmentation sequences, which is the primary source of ensemble diversity. Occasionally, one seed leads a model into a suboptimal basin (as seen with Model 1 and Animation).

Class Difficulty Analysis

ClassDifficultyReason
Natural_ContentEasiestStrong texture/motion cues; best F1 across all models (95–97%)
Flat_ContentEasyLow motion, consistent color palettes; F1 93–97% across models
GamingModerateFast motion + HUD overlaps with other categories in some games
AnimationHardestHighly varied styles from hand-drawn to 3D CGI; prone to collapse

Why the Ensemble Helps

The key mechanism is probability averaging. For each test video, each model outputs a softmax probability vector over the four classes. The ensemble averages these four vectors before taking the argmax.
# Collect predictions from all models
all_predictions = []
for model in ensemble_models:
    model.eval()
    model_predictions = []
    with torch.no_grad():
        for features, labels, lengths in test_loader:
            outputs = model(features, lengths)
            probs = F.softmax(outputs, dim=1)
            model_predictions.append(probs.cpu())
    all_predictions.append(torch.cat(model_predictions))

# Average predictions
ensemble_predictions = torch.stack(all_predictions).mean(dim=0)
For an Animation video:
  • Model 1 assigns near-zero probability to Animation (≈0.01)
  • Models 2, 3, 4 each assign ~0.87 probability to Animation
  • Averaged: (0.01 + 0.87 + 0.87 + 0.87) / 4 ≈ 0.655 — still the highest class
This demonstrates that even a severely degraded model does not destroy the ensemble, provided the remaining models are confident and correct.

Training Visualization of Per-Class Metrics

The RealTimeTrainingVisualizer tracks per-class F1 throughout training and renders it in two ways per epoch:
  1. F1 progression plot — time-series of weighted F1 for train and validation sets
  2. Per-class F1 bar chart — current epoch’s per-class scores with mean line
def _plot_current_per_class_f1(self, ax, history, current_epoch):
    per_class_f1 = history['val_per_class_f1'][current_epoch]
    class_names = [f'Class {i}' for i in range(len(per_class_f1))]

    colors = plt.cm.tab10(np.linspace(0, 1, len(per_class_f1)))
    bars = ax.bar(class_names, per_class_f1, color=colors,
                  alpha=0.7, edgecolor='black', linewidth=1.5)

    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.1f}%', ha='center', va='bottom',
                fontsize=10, fontweight='bold')

    # Add mean line
    mean_f1 = np.mean(per_class_f1)
    ax.axhline(y=mean_f1, color='red', linestyle='--', linewidth=2,
               alpha=0.7, label=f'Mean: {mean_f1:.1f}%')
Per-class F1 history is stored in history['val_per_class_f1'] as a list of lists (one inner list per epoch), and exported to training_metrics.csv with columns class_0_f1 through class_3_f1.

Recommendations for Improvement

The Animation class is the primary bottleneck. Expanding the Animation subset of the YouTube-8M training split — particularly with diverse styles (anime, 3D CGI, stop-motion, motion graphics) — would provide the model with more discriminative features and reduce sensitivity to training-run variance.Consider subcategory-stratified sampling to ensure all animation styles are represented proportionally across train/val splits.
The current TTA applies the same four modes (original, reverse, speed_up, speed_down) uniformly across all classes. For Animation specifically, additional temporal augmentations — such as random subsequence sampling or pitch-shifted frame rates — might be more informative than simple speed changes.A class-conditional TTA could apply stronger augmentation to classes with historically lower confidence scores, while keeping inference fast for easy classes like Natural_Content.
After initial ensemble training, an additional fine-tuning stage focused on Animation (e.g., higher class weight in Focal Loss, or a second training phase with Animation-oversampled batches) could close the remaining gap without retraining all models from scratch.The existing WeightedRandomSampler infrastructure already supports this — increasing the class weight for Animation in _compute_class_weights() would bias sampling toward harder Animation samples.
The current four-model ensemble averages softmax probabilities (soft voting). Training a lightweight meta-learner (stacking) on the four models’ validation outputs could learn to weight Model 1’s contributions by class — giving it near-zero weight for Animation and higher weight for Natural_Content and Flat_Content.

Build docs developers (and LLMs) love