Skip to main content

Overview

The NVIDIA Video Classification Project achieves strong results on a four-class video classification task. Metrics are computed using scikit-learn on held-out test data after training on YouTube-8M–derived clips across Animation, Flat_Content, Gaming, and Natural_Content.

Standard Ensemble

~93% test accuracyFour-model ensemble inference without augmentation. Weighted F1 score of approximately 92%.

Ensemble + TTA

~95% test accuracyFour TTA augmentation modes applied on top of ensemble averaging. Meets the project’s target accuracy.

Best Individual Model

92.13% val accuracyModel 2 (checkpoint epoch 43) achieved the highest single-model validation accuracy.

Weighted F1 (Ensemble)

~92% weighted F1Computed with sklearn.metrics.f1_score(average='weighted') at the ensemble level.

Test Set Configuration

All evaluation figures reported on this page use the held-out test split whose features were pre-extracted to test_features_multiscale.h5.
PropertyValue
Feature filetest_features_multiscale.h5
Number of videos412
Feature dimension1280
Max frames per video73
Multi-scale extractionTrue
Features tensor shape[412, 73, 1280]
Category mapping{"Animation": 0, "Flat_Content": 1, "Gaming": 2, "Natural_Content": 3}
Multi-scale extraction averages features from three temporal scales (1.0×, 0.85×, 1.15×) per video. This enriches the feature representation and is one reason the 1280-dim EfficientNet-V2-S backbone generalizes well across diverse content types.

Metrics Definitions

Accuracy

Standard top-1 accuracy: the fraction of test videos whose predicted class matches the ground-truth label.
accuracy = correct_predictions / total_predictions × 100

Weighted F1 Score

Scikit-learn’s f1_score with average='weighted' weights each class’s F1 by its support (number of true instances). This accounts for class imbalance and is the primary aggregate metric used throughout training.
from sklearn.metrics import f1_score
avg_f1 = f1_score(labels_np, predicted_np, average='weighted', zero_division=0)

Per-Class F1

Per-class precision, recall, and F1 are extracted via precision_recall_fscore_support with average=None, giving one score per category. These values diagnose which content types are hardest to classify.
from sklearn.metrics import precision_recall_fscore_support
precision, recall, f1, _ = precision_recall_fscore_support(
    labels_np, predicted_np,
    labels=list(range(num_classes)),
    average=None,
    zero_division=0
)

Metrics Computation Code

The compute_metrics method in EnhancedTemporalModelTrainer centralizes all metric calculations. It is called identically during training (per epoch) and final test evaluation.
def compute_metrics(self, outputs, labels):
    _, predicted = outputs.max(1)

    correct = predicted.eq(labels).sum().item()
    accuracy = 100. * correct / labels.size(0)

    labels_np = labels.cpu().numpy()
    predicted_np = predicted.cpu().numpy()

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels_np, predicted_np,
        labels=list(range(self.num_classes)),
        average=None, zero_division=0
    )

    avg_f1 = f1_score(labels_np, predicted_np, average='weighted', zero_division=0)

    return {
        'accuracy': accuracy,
        'f1_weighted': avg_f1 * 100,
        'f1_per_class': f1 * 100
    }
The method accepts raw logit tensors (outputs) and integer label tensors. The argmax over class dimension (outputs.max(1)) produces the predicted class index. Both f1_weighted and f1_per_class are returned as percentage values (multiplied by 100).

Validation Metrics Table

Four independent models were trained with different random seeds (seeds 42–45). Each checkpoint was saved at the epoch where validation accuracy peaked. The table below reflects the saved best_val_acc and corresponding metrics stored in each checkpoint.
ModelCheckpointBest EpochVal AccuracyWeighted F1
Model 1best_ensemble_model_1.pt5272.73%64.76%
Model 2best_ensemble_model_2.pt4392.13%92.15%
Model 3best_ensemble_model_3.pt4291.86%91.87%
Model 4best_ensemble_model_4.pt4091.86%91.88%
Model 1 shows significantly lower metrics (72.73% accuracy, 64.76% F1) compared to models 2–4. This is caused by a near-total collapse on the Animation class during that training run (Animation F1: 1.10%). Despite this, Model 1 still contributes useful signal for non-Animation classes within the ensemble. See the Per-Class Results page for a detailed breakdown.
Shared model configuration across all four checkpoints:
HyperparameterValue
Feature dimension1280
Hidden dimension768
Num classes4
LSTM layers4
Attention heads12
Dropout0.4
BidirectionalTrue
Learning rate0.001
Batch size48
Max epochs150
Early-stop patience25

Effect of Test-Time Augmentation

TTA runs each test video through the same ensemble four times, each time with a different temporal transformation. The four softmax probability vectors are averaged before taking the argmax.
ModeDescriptionEffect
None (original)Sequence used as-isBaseline prediction
reversetorch.flip(features, dims=[0])Model sees frames in reverse temporal order
speed_upSample every 2nd frame (num_frames // 2 indices)Compressed temporal view
speed_downUpsample to num_frames × 1.5 via linspaceStretched temporal view
TTA requires no additional training. The four augmentation modes are applied at inference time only. For single-video classification via the Flask deployment, TTA can be toggled with use_tta=True in SingleVideoClassifier.classify_video().

Training Convergence

All four models were trained for up to 150 epochs with early stopping (patience = 25). Models 2, 3, and 4 converged in the early-to-mid 40s, indicating stable training dynamics with the Cosine Annealing with Warm Restarts scheduler.
  • Loss function: Focal Loss with label smoothing (gamma=2.0, smoothing=0.1) plus class-weighted alpha
  • Optimizer: AdamW (weight_decay=5e-4, betas=(0.9, 0.999))
  • Scheduler: CosineAnnealingWarmRestarts (T_0=20, T_mult=2, eta_min=1e-6)
  • Regularization: Dropout (0.4), gradient clipping (max_norm=1.0), weight decay
  • Balanced sampling: WeightedRandomSampler ensures all classes are seen equally during training
  • Target accuracy: 95% (achieved with TTA on the ensemble)
The RealTimeTrainingVisualizer generates per-epoch PNG dashboards covering loss curves, accuracy curves, F1 progression, per-class F1 bar charts, and the learning rate schedule. All plots are saved under results/training_progress/epoch_plots/.
The training script includes a 95% target line on accuracy plots to make convergence progress visually clear:
# Add 95% target line to accuracy plot
ax.axhline(y=95, color='purple', linestyle=':', linewidth=2,
           alpha=0.7, label='Target (95%)')
With the four-model ensemble and TTA, the project meets its stated target of greater than 95% test accuracy.

Build docs developers (and LLMs) love