Performance Metrics

Overview

The NVIDIA Video Classification Project achieves strong results on a four-class video classification task. Metrics are computed using scikit-learn on held-out test data after training on YouTube-8M–derived clips across Animation, Flat_Content, Gaming, and Natural_Content.

Standard Ensemble

~93% test accuracyFour-model ensemble inference without augmentation. Weighted F1 score of approximately 92%.

Ensemble + TTA

~95% test accuracyFour TTA augmentation modes applied on top of ensemble averaging. Meets the project’s target accuracy.

Best Individual Model

92.13% val accuracyModel 2 (checkpoint epoch 43) achieved the highest single-model validation accuracy.

Weighted F1 (Ensemble)

~92% weighted F1Computed with sklearn.metrics.f1_score(average='weighted') at the ensemble level.

Test Set Configuration

All evaluation figures reported on this page use the held-out test split whose features were pre-extracted to test_features_multiscale.h5.

Property	Value
Feature file	`test_features_multiscale.h5`
Number of videos	412
Feature dimension	1280
Max frames per video	73
Multi-scale extraction	`True`
Features tensor shape	`[412, 73, 1280]`
Category mapping	`{"Animation": 0, "Flat_Content": 1, "Gaming": 2, "Natural_Content": 3}`

Multi-scale extraction averages features from three temporal scales (1.0×, 0.85×, 1.15×) per video. This enriches the feature representation and is one reason the 1280-dim EfficientNet-V2-S backbone generalizes well across diverse content types.

Metrics Definitions

Accuracy

Standard top-1 accuracy: the fraction of test videos whose predicted class matches the ground-truth label.

accuracy = correct_predictions / total_predictions × 100

Weighted F1 Score

Scikit-learn’s f1_score with average='weighted' weights each class’s F1 by its support (number of true instances). This accounts for class imbalance and is the primary aggregate metric used throughout training.

from sklearn.metrics import f1_score
avg_f1 = f1_score(labels_np, predicted_np, average='weighted', zero_division=0)

Per-Class F1

Per-class precision, recall, and F1 are extracted via precision_recall_fscore_support with average=None, giving one score per category. These values diagnose which content types are hardest to classify.

from sklearn.metrics import precision_recall_fscore_support
precision, recall, f1, _ = precision_recall_fscore_support(
    labels_np, predicted_np,
    labels=list(range(num_classes)),
    average=None,
    zero_division=0
)

Metrics Computation Code

The compute_metrics method in EnhancedTemporalModelTrainer centralizes all metric calculations. It is called identically during training (per epoch) and final test evaluation.

def compute_metrics(self, outputs, labels):
    _, predicted = outputs.max(1)

    correct = predicted.eq(labels).sum().item()
    accuracy = 100. * correct / labels.size(0)

    labels_np = labels.cpu().numpy()
    predicted_np = predicted.cpu().numpy()

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels_np, predicted_np,
        labels=list(range(self.num_classes)),
        average=None, zero_division=0
    )

    avg_f1 = f1_score(labels_np, predicted_np, average='weighted', zero_division=0)

    return {
        'accuracy': accuracy,
        'f1_weighted': avg_f1 * 100,
        'f1_per_class': f1 * 100
    }

The method accepts raw logit tensors (outputs) and integer label tensors. The argmax over class dimension (outputs.max(1)) produces the predicted class index. Both f1_weighted and f1_per_class are returned as percentage values (multiplied by 100).

Validation Metrics Table

Four independent models were trained with different random seeds (seeds 42–45). Each checkpoint was saved at the epoch where validation accuracy peaked. The table below reflects the saved best_val_acc and corresponding metrics stored in each checkpoint.

Model	Checkpoint	Best Epoch	Val Accuracy	Weighted F1
Model 1	`best_ensemble_model_1.pt`	52	72.73%	64.76%
Model 2	`best_ensemble_model_2.pt`	43	92.13%	92.15%
Model 3	`best_ensemble_model_3.pt`	42	91.86%	91.87%
Model 4	`best_ensemble_model_4.pt`	40	91.86%	91.88%

Model 1 shows significantly lower metrics (72.73% accuracy, 64.76% F1) compared to models 2–4. This is caused by a near-total collapse on the Animation class during that training run (Animation F1: 1.10%). Despite this, Model 1 still contributes useful signal for non-Animation classes within the ensemble. See the Per-Class Results page for a detailed breakdown.

Shared model configuration across all four checkpoints:

Hyperparameter	Value
Feature dimension	1280
Hidden dimension	768
Num classes	4
LSTM layers	4
Attention heads	12
Dropout	0.4
Bidirectional	True
Learning rate	0.001
Batch size	48
Max epochs	150
Early-stop patience	25

Effect of Test-Time Augmentation

TTA runs each test video through the same ensemble four times, each time with a different temporal transformation. The four softmax probability vectors are averaged before taking the argmax.

TTA Modes
TTA Code
Accuracy Comparison

Mode	Description	Effect
`None` (original)	Sequence used as-is	Baseline prediction
`reverse`	`torch.flip(features, dims=[0])`	Model sees frames in reverse temporal order
`speed_up`	Sample every 2nd frame (`num_frames // 2` indices)	Compressed temporal view
`speed_down`	Upsample to `num_frames × 1.5` via `linspace`	Stretched temporal view

def predict_with_tta(self, features):
    tta_predictions = []

    # Mode 1: Original
    probs = self.predict_standard(features)
    tta_predictions.append(probs)

    # Mode 2: Reverse
    features_reversed = torch.flip(features, dims=[0])
    probs = self.predict_standard(features_reversed)
    tta_predictions.append(probs)

    # Mode 3: Speed up (skip frames)
    if features.shape[0] > 10:
        indices = torch.linspace(0, features.shape[0]-1,
                                 features.shape[0]//2).long()
        features_speedup = features[indices]
        probs = self.predict_standard(features_speedup)
        tta_predictions.append(probs)

    # Mode 4: Speed down (interpolate frames)
    if features.shape[0] > 10:
        indices = torch.linspace(0, features.shape[0]-1,
                                 int(features.shape[0]*1.5)).long()
        indices = indices.clamp(max=features.shape[0]-1)
        features_speeddown = features[indices]
        probs = self.predict_standard(features_speeddown)
        tta_predictions.append(probs)

    # Average TTA predictions
    ensemble_probs = torch.stack(tta_predictions).mean(dim=0)
    return ensemble_probs

Inference Mode	Test Accuracy
Standard (no TTA)	~93%
With TTA (4 modes)	~95%
TTA improvement	~+2 pp

TTA is particularly impactful on the Animation class, where frame-order and speed variations provide complementary evidence that helps the ensemble recover from individual model weaknesses.

TTA requires no additional training. The four augmentation modes are applied at inference time only. For single-video classification via the Flask deployment, TTA can be toggled with use_tta=True in SingleVideoClassifier.classify_video().

Training Convergence

All four models were trained for up to 150 epochs with early stopping (patience = 25). Models 2, 3, and 4 converged in the early-to-mid 40s, indicating stable training dynamics with the Cosine Annealing with Warm Restarts scheduler.

Training setup details

Loss function: Focal Loss with label smoothing (gamma=2.0, smoothing=0.1) plus class-weighted alpha
Optimizer: AdamW (weight_decay=5e-4, betas=(0.9, 0.999))
Scheduler: CosineAnnealingWarmRestarts (T_0=20, T_mult=2, eta_min=1e-6)
Regularization: Dropout (0.4), gradient clipping (max_norm=1.0), weight decay
Balanced sampling: WeightedRandomSampler ensures all classes are seen equally during training
Target accuracy: 95% (achieved with TTA on the ensemble)

The RealTimeTrainingVisualizer generates per-epoch PNG dashboards covering loss curves, accuracy curves, F1 progression, per-class F1 bar charts, and the learning rate schedule. All plots are saved under results/training_progress/epoch_plots/.

The training script includes a 95% target line on accuracy plots to make convergence progress visually clear:

# Add 95% target line to accuracy plot
ax.axhline(y=95, color='purple', linestyle=':', linewidth=2,
           alpha=0.7, label='Target (95%)')

With the four-model ensemble and TTA, the project meets its stated target of greater than 95% test accuracy.

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Overview

Standard Ensemble

Ensemble + TTA

Best Individual Model

Weighted F1 (Ensemble)

Test Set Configuration

Metrics Definitions

Accuracy

Weighted F1 Score

Per-Class F1

Metrics Computation Code

Validation Metrics Table

Effect of Test-Time Augmentation

Training Convergence

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​Overview

Standard Ensemble

Ensemble + TTA

Best Individual Model

Weighted F1 (Ensemble)

​Test Set Configuration

​Metrics Definitions

​Accuracy

​Weighted F1 Score

​Per-Class F1

​Metrics Computation Code

​Validation Metrics Table

​Effect of Test-Time Augmentation

​Training Convergence

Build docs developers (and LLMs) love

Overview

Test Set Configuration

Metrics Definitions

Accuracy

Weighted F1 Score

Per-Class F1

Metrics Computation Code

Validation Metrics Table

Effect of Test-Time Augmentation

Training Convergence