Per-Class Results

Category Overview

The classifier distinguishes four content categories sourced from YouTube-8M. Each category has distinct visual and temporal characteristics that affect classification difficulty.

Animation

Animated content spans a wide range of visual styles — from cel animation to 3D CGI — making it the hardest class to classify consistently.

Flat_Content

Screen recordings, slideshows, and presentation-style videos. Highly consistent low-motion patterns make this one of the easier classes.

Gaming

Gameplay footage with HUD elements and fast motion. Intermediate difficulty; visual patterns are consistent within genres but vary across games.

Natural_Content

Outdoor and real-world footage. Strong texture and motion cues make this the most reliably classified category.

Category-to-index mapping (as stored in test_features_multiscale.h5):

{"Animation": 0, "Flat_Content": 1, "Gaming": 2, "Natural_Content": 3}

Per-Class F1 Scores

Per-class F1 scores are extracted from precision_recall_fscore_support with average=None and scaled to percentage. The table below presents values from all four checkpoint files as recorded in configuration_analysis.json.

Class	Model 1 F1	Model 2 F1	Model 3 F1	Model 4 F1	Best
Animation	1.10%	86.46%	87.67%	88.02%	88.02%
Flat_Content	93.47%	96.74%	95.47%	95.24%	96.74%
Gaming	66.06%	87.88%	88.00%	88.48%	88.48%
Natural_Content	96.69%	97.52%	96.38%	95.77%	97.52%

These per-class F1 scores reflect validation set performance at the best checkpoint epoch for each model, not final test-set performance. The ensemble + TTA figures reported on the Performance Metrics page are higher because they combine all four models at inference time.

Model 1 Animation Anomaly

Model 1 produced an Animation F1 of 1.10% — effectively a complete failure on that class — while achieving reasonable F1 scores for the other three categories (93.47%, 66.06%, 96.69%).

An F1 of 1.10% on Animation means Model 1 predicted almost every Animation sample as a different class. This is a training-run collapse, not a data issue. Models 2, 3, and 4 all scored 86–88% on Animation, confirming the class is learnable.

This behavior illustrates why ensemble diversity matters:

Why Model 1 still adds value to the ensemble

Even though Model 1 collapsed on Animation, its predictions for Flat_Content (93.47%) and Natural_Content (96.69%) are competitive with the other models. When the four models’ softmax probabilities are averaged, Models 2–4 dominate for Animation samples (driving the Animation probability high), while Model 1 still contributes accurate signal for the classes it handles well.Removing Model 1 entirely would slightly reduce confidence on Flat_Content and Natural_Content predictions. The ensemble approach tolerates individual model weaknesses as long as at least one model produces a confident correct prediction per class.

The train_ensemble function in model_train_new.py trains each model with a different random seed:

for i in range(num_models):
    # Set different random seed for each model
    torch.manual_seed(42 + i)
    np.random.seed(42 + i)
    random.seed(42 + i)

    model, results = self.train_single_model(
        num_epochs=num_epochs,
        batch_size=batch_size,
        learning_rate=learning_rate,
        model_name=f'ensemble_model_{i}',
        ...
    )

Different seeds produce different weight initializations and data-augmentation sequences, which is the primary source of ensemble diversity. Occasionally, one seed leads a model into a suboptimal basin (as seen with Model 1 and Animation).

Class Difficulty Analysis

Summary
Natural Content
Flat Content
Gaming
Animation

Class	Difficulty	Reason
Natural_Content	Easiest	Strong texture/motion cues; best F1 across all models (95–97%)
Flat_Content	Easy	Low motion, consistent color palettes; F1 93–97% across models
Gaming	Moderate	Fast motion + HUD overlaps with other categories in some games
Animation	Hardest	Highly varied styles from hand-drawn to 3D CGI; prone to collapse

Why the Ensemble Helps

The key mechanism is probability averaging. For each test video, each model outputs a softmax probability vector over the four classes. The ensemble averages these four vectors before taking the argmax.

# Collect predictions from all models
all_predictions = []
for model in ensemble_models:
    model.eval()
    model_predictions = []
    with torch.no_grad():
        for features, labels, lengths in test_loader:
            outputs = model(features, lengths)
            probs = F.softmax(outputs, dim=1)
            model_predictions.append(probs.cpu())
    all_predictions.append(torch.cat(model_predictions))

# Average predictions
ensemble_predictions = torch.stack(all_predictions).mean(dim=0)

For an Animation video:

Model 1 assigns near-zero probability to Animation (≈0.01)
Models 2, 3, 4 each assign ~0.87 probability to Animation
Averaged: (0.01 + 0.87 + 0.87 + 0.87) / 4 ≈ 0.655 — still the highest class

This demonstrates that even a severely degraded model does not destroy the ensemble, provided the remaining models are confident and correct.

Training Visualization of Per-Class Metrics

The RealTimeTrainingVisualizer tracks per-class F1 throughout training and renders it in two ways per epoch:

F1 progression plot — time-series of weighted F1 for train and validation sets
Per-class F1 bar chart — current epoch’s per-class scores with mean line

def _plot_current_per_class_f1(self, ax, history, current_epoch):
    per_class_f1 = history['val_per_class_f1'][current_epoch]
    class_names = [f'Class {i}' for i in range(len(per_class_f1))]

    colors = plt.cm.tab10(np.linspace(0, 1, len(per_class_f1)))
    bars = ax.bar(class_names, per_class_f1, color=colors,
                  alpha=0.7, edgecolor='black', linewidth=1.5)

    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.1f}%', ha='center', va='bottom',
                fontsize=10, fontweight='bold')

    # Add mean line
    mean_f1 = np.mean(per_class_f1)
    ax.axhline(y=mean_f1, color='red', linestyle='--', linewidth=2,
               alpha=0.7, label=f'Mean: {mean_f1:.1f}%')

Per-class F1 history is stored in history['val_per_class_f1'] as a list of lists (one inner list per epoch), and exported to training_metrics.csv with columns class_0_f1 through class_3_f1.

Recommendations for Improvement

More Animation training data

The Animation class is the primary bottleneck. Expanding the Animation subset of the YouTube-8M training split — particularly with diverse styles (anime, 3D CGI, stop-motion, motion graphics) — would provide the model with more discriminative features and reduce sensitivity to training-run variance.Consider subcategory-stratified sampling to ensure all animation styles are represented proportionally across train/val splits.

Class-specific TTA strategies

The current TTA applies the same four modes (original, reverse, speed_up, speed_down) uniformly across all classes. For Animation specifically, additional temporal augmentations — such as random subsequence sampling or pitch-shifted frame rates — might be more informative than simple speed changes.A class-conditional TTA could apply stronger augmentation to classes with historically lower confidence scores, while keeping inference fast for easy classes like Natural_Content.

Targeted fine-tuning on Animation

After initial ensemble training, an additional fine-tuning stage focused on Animation (e.g., higher class weight in Focal Loss, or a second training phase with Animation-oversampled batches) could close the remaining gap without retraining all models from scratch.The existing WeightedRandomSampler infrastructure already supports this — increasing the class weight for Animation in _compute_class_weights() would bias sampling toward harder Animation samples.

Larger ensemble or stacking

The current four-model ensemble averages softmax probabilities (soft voting). Training a lightweight meta-learner (stacking) on the four models’ validation outputs could learn to weight Model 1’s contributions by class — giving it near-zero weight for Animation and higher weight for Natural_Content and Flat_Content.

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Category Overview

Animation

Flat_Content

Gaming

Natural_Content

Per-Class F1 Scores

Model 1 Animation Anomaly

Class Difficulty Analysis

Why the Ensemble Helps

Training Visualization of Per-Class Metrics

Recommendations for Improvement

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​Category Overview

Animation

Flat_Content

Gaming

Natural_Content

​Per-Class F1 Scores

​Model 1 Animation Anomaly

​Class Difficulty Analysis

​Why the Ensemble Helps

​Training Visualization of Per-Class Metrics

​Recommendations for Improvement

Build docs developers (and LLMs) love

Category Overview

Per-Class F1 Scores

Model 1 Animation Anomaly

Class Difficulty Analysis

Why the Ensemble Helps

Training Visualization of Per-Class Metrics

Recommendations for Improvement