Category Overview
The classifier distinguishes four content categories sourced from YouTube-8M. Each category has distinct visual and temporal characteristics that affect classification difficulty.Animation
Animated content spans a wide range of visual styles — from cel animation to 3D CGI — making it the hardest class to classify consistently.
Flat_Content
Screen recordings, slideshows, and presentation-style videos. Highly consistent low-motion patterns make this one of the easier classes.
Gaming
Gameplay footage with HUD elements and fast motion. Intermediate difficulty; visual patterns are consistent within genres but vary across games.
Natural_Content
Outdoor and real-world footage. Strong texture and motion cues make this the most reliably classified category.
test_features_multiscale.h5):
Per-Class F1 Scores
Per-class F1 scores are extracted fromprecision_recall_fscore_support with average=None and scaled to percentage. The table below presents values from all four checkpoint files as recorded in configuration_analysis.json.
| Class | Model 1 F1 | Model 2 F1 | Model 3 F1 | Model 4 F1 | Best |
|---|---|---|---|---|---|
| Animation | 1.10% | 86.46% | 87.67% | 88.02% | 88.02% |
| Flat_Content | 93.47% | 96.74% | 95.47% | 95.24% | 96.74% |
| Gaming | 66.06% | 87.88% | 88.00% | 88.48% | 88.48% |
| Natural_Content | 96.69% | 97.52% | 96.38% | 95.77% | 97.52% |
These per-class F1 scores reflect validation set performance at the best checkpoint epoch for each model, not final test-set performance. The ensemble + TTA figures reported on the Performance Metrics page are higher because they combine all four models at inference time.
Model 1 Animation Anomaly
Model 1 produced an Animation F1 of 1.10% — effectively a complete failure on that class — while achieving reasonable F1 scores for the other three categories (93.47%, 66.06%, 96.69%). This behavior illustrates why ensemble diversity matters:Why Model 1 still adds value to the ensemble
Why Model 1 still adds value to the ensemble
Even though Model 1 collapsed on Animation, its predictions for Flat_Content (93.47%) and Natural_Content (96.69%) are competitive with the other models. When the four models’ softmax probabilities are averaged, Models 2–4 dominate for Animation samples (driving the Animation probability high), while Model 1 still contributes accurate signal for the classes it handles well.Removing Model 1 entirely would slightly reduce confidence on Flat_Content and Natural_Content predictions. The ensemble approach tolerates individual model weaknesses as long as at least one model produces a confident correct prediction per class.
train_ensemble function in model_train_new.py trains each model with a different random seed:
Class Difficulty Analysis
- Summary
- Natural Content
- Flat Content
- Gaming
- Animation
| Class | Difficulty | Reason |
|---|---|---|
| Natural_Content | Easiest | Strong texture/motion cues; best F1 across all models (95–97%) |
| Flat_Content | Easy | Low motion, consistent color palettes; F1 93–97% across models |
| Gaming | Moderate | Fast motion + HUD overlaps with other categories in some games |
| Animation | Hardest | Highly varied styles from hand-drawn to 3D CGI; prone to collapse |
Why the Ensemble Helps
The key mechanism is probability averaging. For each test video, each model outputs a softmax probability vector over the four classes. The ensemble averages these four vectors before taking the argmax.- Model 1 assigns near-zero probability to Animation (≈0.01)
- Models 2, 3, 4 each assign ~0.87 probability to Animation
- Averaged:
(0.01 + 0.87 + 0.87 + 0.87) / 4 ≈ 0.655— still the highest class
Training Visualization of Per-Class Metrics
TheRealTimeTrainingVisualizer tracks per-class F1 throughout training and renders it in two ways per epoch:
- F1 progression plot — time-series of weighted F1 for train and validation sets
- Per-class F1 bar chart — current epoch’s per-class scores with mean line
history['val_per_class_f1'] as a list of lists (one inner list per epoch), and exported to training_metrics.csv with columns class_0_f1 through class_3_f1.
Recommendations for Improvement
More Animation training data
More Animation training data
The Animation class is the primary bottleneck. Expanding the Animation subset of the YouTube-8M training split — particularly with diverse styles (anime, 3D CGI, stop-motion, motion graphics) — would provide the model with more discriminative features and reduce sensitivity to training-run variance.Consider subcategory-stratified sampling to ensure all animation styles are represented proportionally across train/val splits.
Class-specific TTA strategies
Class-specific TTA strategies
The current TTA applies the same four modes (original, reverse, speed_up, speed_down) uniformly across all classes. For Animation specifically, additional temporal augmentations — such as random subsequence sampling or pitch-shifted frame rates — might be more informative than simple speed changes.A class-conditional TTA could apply stronger augmentation to classes with historically lower confidence scores, while keeping inference fast for easy classes like Natural_Content.
Targeted fine-tuning on Animation
Targeted fine-tuning on Animation
After initial ensemble training, an additional fine-tuning stage focused on Animation (e.g., higher class weight in Focal Loss, or a second training phase with Animation-oversampled batches) could close the remaining gap without retraining all models from scratch.The existing
WeightedRandomSampler infrastructure already supports this — increasing the class weight for Animation in _compute_class_weights() would bias sampling toward harder Animation samples.Larger ensemble or stacking
Larger ensemble or stacking
The current four-model ensemble averages softmax probabilities (soft voting). Training a lightweight meta-learner (stacking) on the four models’ validation outputs could learn to weight Model 1’s contributions by class — giving it near-zero weight for Animation and higher weight for Natural_Content and Flat_Content.