Overview
The NVIDIA Video Classification Project achieves strong results on a four-class video classification task. Metrics are computed using scikit-learn on held-out test data after training on YouTube-8M–derived clips across Animation, Flat_Content, Gaming, and Natural_Content.Standard Ensemble
~93% test accuracyFour-model ensemble inference without augmentation. Weighted F1 score of approximately 92%.
Ensemble + TTA
~95% test accuracyFour TTA augmentation modes applied on top of ensemble averaging. Meets the project’s target accuracy.
Best Individual Model
92.13% val accuracyModel 2 (checkpoint epoch 43) achieved the highest single-model validation accuracy.
Weighted F1 (Ensemble)
~92% weighted F1Computed with
sklearn.metrics.f1_score(average='weighted') at the ensemble level.Test Set Configuration
All evaluation figures reported on this page use the held-out test split whose features were pre-extracted totest_features_multiscale.h5.
| Property | Value |
|---|---|
| Feature file | test_features_multiscale.h5 |
| Number of videos | 412 |
| Feature dimension | 1280 |
| Max frames per video | 73 |
| Multi-scale extraction | True |
| Features tensor shape | [412, 73, 1280] |
| Category mapping | {"Animation": 0, "Flat_Content": 1, "Gaming": 2, "Natural_Content": 3} |
Multi-scale extraction averages features from three temporal scales (1.0×, 0.85×, 1.15×) per video. This enriches the feature representation and is one reason the 1280-dim EfficientNet-V2-S backbone generalizes well across diverse content types.
Metrics Definitions
Accuracy
Standard top-1 accuracy: the fraction of test videos whose predicted class matches the ground-truth label.Weighted F1 Score
Scikit-learn’sf1_score with average='weighted' weights each class’s F1 by its support (number of true instances). This accounts for class imbalance and is the primary aggregate metric used throughout training.
Per-Class F1
Per-class precision, recall, and F1 are extracted viaprecision_recall_fscore_support with average=None, giving one score per category. These values diagnose which content types are hardest to classify.
Metrics Computation Code
Thecompute_metrics method in EnhancedTemporalModelTrainer centralizes all metric calculations. It is called identically during training (per epoch) and final test evaluation.
The method accepts raw logit tensors (
outputs) and integer label tensors. The argmax over class dimension (outputs.max(1)) produces the predicted class index. Both f1_weighted and f1_per_class are returned as percentage values (multiplied by 100).Validation Metrics Table
Four independent models were trained with different random seeds (seeds 42–45). Each checkpoint was saved at the epoch where validation accuracy peaked. The table below reflects the savedbest_val_acc and corresponding metrics stored in each checkpoint.
| Model | Checkpoint | Best Epoch | Val Accuracy | Weighted F1 |
|---|---|---|---|---|
| Model 1 | best_ensemble_model_1.pt | 52 | 72.73% | 64.76% |
| Model 2 | best_ensemble_model_2.pt | 43 | 92.13% | 92.15% |
| Model 3 | best_ensemble_model_3.pt | 42 | 91.86% | 91.87% |
| Model 4 | best_ensemble_model_4.pt | 40 | 91.86% | 91.88% |
| Hyperparameter | Value |
|---|---|
| Feature dimension | 1280 |
| Hidden dimension | 768 |
| Num classes | 4 |
| LSTM layers | 4 |
| Attention heads | 12 |
| Dropout | 0.4 |
| Bidirectional | True |
| Learning rate | 0.001 |
| Batch size | 48 |
| Max epochs | 150 |
| Early-stop patience | 25 |
Effect of Test-Time Augmentation
TTA runs each test video through the same ensemble four times, each time with a different temporal transformation. The four softmax probability vectors are averaged before taking the argmax.- TTA Modes
- TTA Code
- Accuracy Comparison
| Mode | Description | Effect |
|---|---|---|
None (original) | Sequence used as-is | Baseline prediction |
reverse | torch.flip(features, dims=[0]) | Model sees frames in reverse temporal order |
speed_up | Sample every 2nd frame (num_frames // 2 indices) | Compressed temporal view |
speed_down | Upsample to num_frames × 1.5 via linspace | Stretched temporal view |
Training Convergence
All four models were trained for up to 150 epochs with early stopping (patience = 25). Models 2, 3, and 4 converged in the early-to-mid 40s, indicating stable training dynamics with the Cosine Annealing with Warm Restarts scheduler.Training setup details
Training setup details
- Loss function: Focal Loss with label smoothing (
gamma=2.0,smoothing=0.1) plus class-weighted alpha - Optimizer: AdamW (
weight_decay=5e-4,betas=(0.9, 0.999)) - Scheduler:
CosineAnnealingWarmRestarts(T_0=20,T_mult=2,eta_min=1e-6) - Regularization: Dropout (0.4), gradient clipping (
max_norm=1.0), weight decay - Balanced sampling:
WeightedRandomSamplerensures all classes are seen equally during training - Target accuracy: 95% (achieved with TTA on the ensemble)
RealTimeTrainingVisualizer generates per-epoch PNG dashboards covering loss curves, accuracy curves, F1 progression, per-class F1 bar charts, and the learning rate schedule. All plots are saved under results/training_progress/epoch_plots/.95% target line on accuracy plots to make convergence progress visually clear: