Model Evaluation - UC Intel Final

Overview

Model evaluation measures how well your trained model performs on unseen data. The UC Intel Final platform tracks multiple metrics during training and provides comprehensive test set evaluation.

Training Metrics

During training, the platform computes and tracks metrics for both training and validation sets. Source: app/training/engine.py:63-130

Metrics Tracked Per Epoch

The training engine computes five key metrics:

self.history = {
    "train_loss": [],
    "train_acc": [],
    "train_precision": [],
    "train_recall": [],
    "train_f1": [],
    "val_loss": [],
    "val_acc": [],
    "val_precision": [],
    "val_recall": [],
    "val_f1": [],
    "lr": [],
}

Training Epoch Metrics Computation

Source: app/training/engine.py:112-130

avg_loss = running_loss / total
accuracy = correct / total

# Compute precision, recall, F1 (macro average)
all_preds = np.array(all_preds)
all_targets = np.array(all_targets)
precision = precision_score(
    all_targets, all_preds, average="macro", zero_division=0
)
recall = recall_score(all_targets, all_preds, average="macro", zero_division=0)
f1 = f1_score(all_targets, all_preds, average="macro", zero_division=0)

return {
    "train_loss": avg_loss,
    "train_acc": accuracy,
    "train_precision": precision,
    "train_recall": recall,
    "train_f1": f1,
}

All classification metrics (precision, recall, F1) use macro averaging, which computes metrics for each class independently and takes the unweighted mean. This treats all classes equally regardless of size.

Core Metrics Explained

Accuracy

Definition: Percentage of correctly classified samples. Formula: Accuracy = Correct Predictions / Total Predictions When to use:

Balanced datasets (all classes have similar sample counts)
Quick overall performance assessment

Limitations:

Misleading on imbalanced datasets
Doesn’t show per-class performance

Example:

90% accuracy on balanced 10-class dataset = excellent
90% accuracy when 90% of data is one class = poor (just predicting majority)

Precision

Definition: Of all samples predicted as a class, what percentage actually belong to that class? Formula: Precision = True Positives / (True Positives + False Positives) Interpretation:

High precision: Few false alarms, model is conservative
Low precision: Many false alarms, model over-predicts this class

When it matters:

When false positives are costly
Example: Flagging benign files as malware (user annoyance)

Recall (Sensitivity)

Definition: Of all samples that actually belong to a class, what percentage did we predict correctly? Formula: Recall = True Positives / (True Positives + False Negatives) Interpretation:

High recall: Few missed cases, model catches most instances
Low recall: Many missed cases, model is too conservative

When it matters:

When false negatives are costly
Example: Missing actual malware (security risk)

F1 Score

Definition: Harmonic mean of precision and recall. Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall) Interpretation:

Balanced metric between precision and recall
Good when you care about both false positives and false negatives
More suitable than accuracy for imbalanced datasets

When to use:

Imbalanced datasets
When both precision and recall matter
Comparing models fairly across classes

Loss

Definition: Quantifies prediction error using cross-entropy. Interpretation:

Lower is better
Should decrease during training
More sensitive than accuracy to prediction confidence

Use cases:

Primary optimization target
Early stopping criterion
Model selection (choose model with lowest validation loss)

Macro vs. Weighted vs. Micro Averaging

The platform uses macro averaging by default.

Macro Average (Used in Platform)

Definition: Compute metric for each class, then take unweighted mean.

precision_score(y_true, y_pred, average="macro")

Characteristics:

Treats all classes equally
Good for imbalanced datasets
Shows if model works well across all classes

Example:

Class A (1000 samples): Precision = 0.95
Class B (100 samples): Precision = 0.60
Macro precision = (0.95 + 0.60) / 2 = 0.775

Use when: You want equal importance for all malware families

Weighted Average

Definition: Compute metric for each class, then take weighted mean by class size.Characteristics:

Weighs classes by support (number of samples)
Closer to overall accuracy
Large classes dominate the metric

Example:

Class A (1000 samples): Precision = 0.95
Class B (100 samples): Precision = 0.60
Weighted precision = (0.95×1000 + 0.60×100) / 1100 = 0.92

Use when: Larger classes are more important

Micro Average

Definition: Aggregate all predictions, then compute metric globally.Characteristics:

Equivalent to accuracy for multi-class
Every sample weighted equally

Use when: Rarely needed (use accuracy instead)

Training Loop Evaluation

Per-Epoch Metrics Display

Source: app/training/engine.py:254-263

print(
    f"Epoch {epoch + 1:3d}/{epochs} | "
    f"Train Loss: {train_metrics['train_loss']:.4f} | "
    f"Train Acc: {train_metrics['train_acc'] * 100:.1f}% | "
    f"Val Loss: {val_metrics['val_loss']:.4f} | "
    f"Val Acc: {val_metrics['val_acc'] * 100:.1f}% | "
    f"LR: {current_lr:.6f} | "
    f"Time: {epoch_time:.1f}s" + (" *" if is_best else "")
)

Output example:

Epoch  15/100 | Train Loss: 0.3421 | Train Acc: 89.2% | Val Loss: 0.4156 | Val Acc: 85.7% | LR: 0.000100 | Time: 45.2s *

The * indicates this epoch achieved the best validation loss so far.

Best Model Selection

Source: app/training/engine.py:240-247

# Check for best model
is_best = val_metrics["val_loss"] < self.best_val_loss
if is_best:
    self.best_val_loss = val_metrics["val_loss"]
    self.best_epoch = epoch + 1
    self.epochs_without_improvement = 0
else:
    self.epochs_without_improvement += 1

Best model criterion: Lowest validation loss (not highest accuracy)Why loss instead of accuracy?

Loss is more sensitive to prediction confidence
Loss reflects probability calibration better
Loss captures near-misses that accuracy doesn’t

Test Set Evaluation

After training, run comprehensive evaluation on the held-out test set. Source: app/training/evaluator.py:15-111

Running Test Evaluation

from training.evaluator import run_test_evaluation

results = run_test_evaluation(
    experiment_id="exp_12345",
    model_config=model_config,
    dataset_config=dataset_config
)

Evaluation Pipeline

Load Best Checkpoint

Load the model weights from the epoch with lowest validation loss

checkpoint_mgr = CheckpointManager()
checkpoint_path = checkpoint_mgr.get_best_checkpoint(experiment_id)
model = build_model(model_config)
checkpoint_mgr.load_checkpoint(checkpoint_path, model)

Create Test DataLoader

Build DataLoader for test set with same preprocessing as training

dataloaders, class_names, _ = create_dataloaders(
    dataset_config,
    {"batch_size": 32},
    num_workers=4,
)
test_loader = dataloaders["test"]

Run Inference

Make predictions on all test samples

all_preds = []
all_targets = []

with torch.no_grad():
    for inputs, targets in test_loader:
        inputs = inputs.to(device)
        outputs = model(inputs)
        _, predicted = outputs.max(1)
        
        all_preds.extend(predicted.cpu().numpy())
        all_targets.extend(targets.numpy())

Compute Metrics

Calculate confusion matrix, classification report, and per-class metrics

Metrics Returned

Source: app/training/evaluator.py:99-111

return {
    "confusion_matrix": cm.tolist(),
    "classification_report": report,
    "class_names": class_names,
    "per_class": {
        "precision": precision.tolist(),
        "recall": recall.tolist(),
        "f1": f1.tolist(),
        "support": support.tolist(),
    },
    "accuracy": float(accuracy),
    "total_samples": len(all_targets),
}

Confusion Matrix

The confusion matrix shows where the model makes mistakes. Source: app/training/evaluator.py:81

cm = confusion_matrix(all_targets, all_preds)

Interpreting the Confusion Matrix

                 Predicted
              A    B    C    D
Actual   A   95    2    1    2   ← Class A: 95% correct
         B    3   88    7    2   ← Class B: 88% correct
         C    1    5   92    2   ← Class C: 92% correct
         D    4    1    3   92   ← Class D: 92% correct
         ↑    ↑    ↑    ↑

Reading the matrix:

Diagonal: Correct predictions (want these high)
Off-diagonal: Misclassifications
Row: Shows where actual class samples were predicted
Column: Shows what the model predicted

Example insights:

Row B, Column C = 7: Seven B samples incorrectly classified as C
Model confuses B and C more than other pairs → investigate similarity

Classification Report

Source: app/training/evaluator.py:83-89

report = classification_report(
    all_targets,
    all_preds,
    target_names=class_names,
    output_dict=True,
    zero_division=0,
)

Example Report

{
  "FamilyA": {
    "precision": 0.92,
    "recall": 0.95,
    "f1-score": 0.93,
    "support": 100
  },
  "FamilyB": {
    "precision": 0.89,
    "recall": 0.88,
    "f1-score": 0.88,
    "support": 100
  },
  "accuracy": 0.91,
  "macro avg": {
    "precision": 0.90,
    "recall": 0.91,
    "f1-score": 0.90,
    "support": 1000
  },
  "weighted avg": {
    "precision": 0.91,
    "recall": 0.91,
    "f1-score": 0.91,
    "support": 1000
  }
}

Key sections:

Per-class metrics: Performance on each malware family
Accuracy: Overall accuracy
Macro avg: Unweighted average across classes
Weighted avg: Weighted by class size
Support: Number of samples per class

Interpreting Training Behavior

Healthy Training

Epoch   1/100 | Train Loss: 2.3012 | Train Acc: 15.2% | Val Loss: 2.2891 | Val Acc: 16.1%
Epoch  10/100 | Train Loss: 1.1234 | Train Acc: 62.3% | Val Loss: 1.2456 | Val Acc: 58.7%
Epoch  20/100 | Train Loss: 0.5432 | Train Acc: 82.1% | Val Loss: 0.6234 | Val Acc: 78.9% *
Epoch  30/100 | Train Loss: 0.3421 | Train Acc: 89.2% | Val Loss: 0.5123 | Val Acc: 82.3% *
Epoch  40/100 | Train Loss: 0.2156 | Train Acc: 93.1% | Val Loss: 0.5089 | Val Acc: 82.8% *
Epoch  50/100 | Train Loss: 0.1543 | Train Acc: 95.4% | Val Loss: 0.5201 | Val Acc: 82.5%

Good signs:

Loss decreases steadily for both train and val
Val accuracy improves over time
Gap between train and val is moderate (<10%)
Best model found at epoch 40

Overfitting

Epoch  20/100 | Train Loss: 0.5432 | Train Acc: 82.1% | Val Loss: 0.6234 | Val Acc: 78.9% *
Epoch  30/100 | Train Loss: 0.2156 | Train Acc: 93.1% | Val Loss: 0.7123 | Val Acc: 76.3%
Epoch  40/100 | Train Loss: 0.0843 | Train Acc: 98.2% | Val Loss: 0.8901 | Val Acc: 73.1%
Epoch  50/100 | Train Loss: 0.0421 | Train Acc: 99.1% | Val Loss: 1.0234 | Val Acc: 71.2%

Warning signs:

Train accuracy increases but val accuracy decreases
Val loss increases while train loss decreases
Large gap between train and val performance (>15%)

Solutions:

Increase dropout
Enable/increase L2 regularization
Add more data augmentation
Use smaller model
Enable early stopping (would have stopped at epoch 20)

Underfitting

Epoch  20/100 | Train Loss: 1.8234 | Train Acc: 35.2% | Val Loss: 1.8456 | Val Acc: 34.1%
Epoch  40/100 | Train Loss: 1.7123 | Train Acc: 38.7% | Val Loss: 1.7234 | Val Acc: 37.9%
Epoch  60/100 | Train Loss: 1.6543 | Train Acc: 41.2% | Val Loss: 1.6891 | Val Acc: 40.1%
Epoch  80/100 | Train Loss: 1.6234 | Train Acc: 42.8% | Val Loss: 1.6543 | Val Acc: 41.7%

Warning signs:

Both train and val accuracy are low
Loss decreases very slowly
Performance plateaus at poor level

Solutions:

Increase model capacity (more layers/filters)
Decrease regularization (lower dropout, remove L2)
Increase learning rate
Train for more epochs
Check data preprocessing

Learning Rate Issues

Too high:

Epoch   1/100 | Train Loss: 2.3012 | Train Acc: 15.2%
Epoch   2/100 | Train Loss: 4.5123 | Train Acc: 12.1%
Epoch   3/100 | Train Loss: NaN | Train Acc: 10.0%

Loss explodes or oscillates wildly → Reduce LR by 10x Too low:

Epoch  20/100 | Train Loss: 2.2891 | Train Acc: 16.3%
Epoch  40/100 | Train Loss: 2.2543 | Train Acc: 17.1%
Epoch  60/100 | Train Loss: 2.2234 | Train Acc: 18.2%

Loss barely decreases → Increase LR by 10x

Model Comparison

When comparing multiple models:

Metrics to Compare

Metric	Priority	Use Case
Test Accuracy	High	Balanced datasets, quick comparison
Macro F1	High	Imbalanced datasets, fair comparison
Per-class F1	High	Identify which families are hard
Confusion Matrix	Medium	Understand error patterns
Training Time	Medium	Production constraints
Model Size	Low	Deployment on edge devices
Inference Speed	Low	Real-time requirements

Example Comparison

Model	Test Acc	Macro F1	Train Time	Parameters	Best For
Custom CNN	82.3%	0.81	15 min	500K	Small datasets
ResNet50 (frozen)	89.7%	0.88	25 min	25M	General use
ResNet50 (fine-tuned)	92.4%	0.91	60 min	25M	Best accuracy
EfficientNet-B3	93.1%	0.92	80 min	12M	Balance
Vision Transformer	93.8%	0.93	120 min	86M	Maximum accuracy

Best Practices

During Training

Monitor Both Train and Val

Always watch both metrics. Large divergence = overfitting.

Use Validation Loss for Selection

Choose model with lowest validation loss, not highest accuracy.

Enable Early Stopping

Patience of 10-15 epochs prevents wasted training time.

Save Training History

Keep all metrics for later analysis and comparison.

After Training

Always Evaluate on Test Set

Test set gives true performance estimate. Never use validation metrics as final results.

Analyze Confusion Matrix

Identify which families are confused → may need more data or better features.

Check Per-Class Metrics

Ensure no class is performing significantly worse (F1 < 0.7 while others > 0.9).

Compare Multiple Models

Train at least 2-3 models with different architectures/hyperparameters.

Imbalanced Datasets

Never rely solely on accuracy for imbalanced datasets!Always use:

Macro F1 score (primary metric)
Per-class precision/recall
Confusion matrix

A model with 95% accuracy might just be predicting the majority class.

Common Issues

High Train Acc, Low Val Acc

Diagnosis: OverfittingSolutions:

Increase dropout to 0.5-0.7
Enable L2 regularization (0.0001-0.001)
Add more data augmentation
Use smaller model
Increase dataset size
Enable early stopping

Low Acc on Both Train and Val

Diagnosis: UnderfittingSolutions:

Use larger model (more layers/filters)
Decrease dropout
Remove L2 regularization
Increase learning rate
Train longer
Check data preprocessing/normalization

Some Classes Have Low F1

Diagnosis: Class-specific issuesPossible causes:

Insufficient training samples for that class
Class is visually similar to others
Mislabeled data

Solutions:

Collect more data for low-performing classes
Use class weights or Focal Loss
Increase augmentation for rare classes
Review confusion matrix to identify confused pairs

Model Predicts Only One Class

Diagnosis: Severe class imbalance or learning failureSolutions:

Use Focal Loss instead of Cross-Entropy
Enable weighted sampler
Check if dataset is extremely imbalanced
Verify learning rate isn’t too high
Check if model is actually training (loss decreasing?)

Reporting Results

When documenting model performance, include:

Essential Metrics

Test Accuracy: Overall performance
Macro F1 Score: Fair comparison across classes
Confusion Matrix: Visual error analysis
Per-Class Metrics: Precision, recall, F1 for each family

Training Details

Model architecture and size
Training hyperparameters (LR, optimizer, scheduler)
Dataset split (train/val/test sizes)
Training duration and best epoch
Hardware used (GPU model)

Example Summary

Model: ResNet50 (Transfer Learning, Fine-tuned)
Dataset: 10 malware families, 1000 samples each
Split: 70% train, 15% val, 15% test

Training:
- Optimizer: AdamW (LR=1e-4, weight_decay=0.01)
- Scheduler: Cosine Annealing
- Loss: Focal Loss (gamma=2.0)
- Epochs: 50 (best at epoch 42)
- Duration: 65 minutes on RTX 3080

Test Set Performance:
- Accuracy: 92.4%
- Macro F1: 0.91
- Macro Precision: 0.92
- Macro Recall: 0.91

Per-Class F1 Range: 0.87 - 0.95
Worst Performing: FamilyB (F1=0.87, often confused with FamilyC)
Best Performing: FamilyA (F1=0.95)

Get Started

Core Concepts

Dashboard Guide

Training

Model Interpretability

​Overview

​Training Metrics

​Metrics Tracked Per Epoch

​Training Epoch Metrics Computation

​Core Metrics Explained

​Accuracy

​Precision

​Recall (Sensitivity)

​F1 Score

​Loss

​Macro vs. Weighted vs. Micro Averaging

​Training Loop Evaluation

​Per-Epoch Metrics Display

​Best Model Selection

​Test Set Evaluation

​Running Test Evaluation

​Evaluation Pipeline

​Metrics Returned

​Confusion Matrix

​Interpreting the Confusion Matrix

​Classification Report

​Example Report

​Interpreting Training Behavior

​Healthy Training

​Overfitting

​Underfitting

​Learning Rate Issues

​Model Comparison

​Metrics to Compare

​Example Comparison

​Best Practices

​During Training

​After Training

​Imbalanced Datasets

​Common Issues

​Reporting Results

​Essential Metrics

​Training Details

​Example Summary

​Next Steps

Dataset Preparation

Hyperparameter Tuning

Build docs developers (and LLMs) love

Overview

Training Metrics

Metrics Tracked Per Epoch

Training Epoch Metrics Computation

Core Metrics Explained

Accuracy

Precision

Recall (Sensitivity)

F1 Score

Loss

Macro vs. Weighted vs. Micro Averaging

Training Loop Evaluation

Per-Epoch Metrics Display

Best Model Selection

Test Set Evaluation

Running Test Evaluation

Evaluation Pipeline

Metrics Returned

Confusion Matrix

Interpreting the Confusion Matrix

Classification Report

Example Report

Interpreting Training Behavior

Healthy Training

Overfitting

Underfitting

Learning Rate Issues

Model Comparison

Metrics to Compare

Example Comparison

Best Practices

During Training

After Training

Imbalanced Datasets

Common Issues

Reporting Results

Essential Metrics

Training Details

Example Summary

Next Steps