Skip to main content

Overview

Model evaluation measures how well your trained model performs on unseen data. The UC Intel Final platform tracks multiple metrics during training and provides comprehensive test set evaluation.

Training Metrics

During training, the platform computes and tracks metrics for both training and validation sets. Source: app/training/engine.py:63-130

Metrics Tracked Per Epoch

The training engine computes five key metrics:
self.history = {
    "train_loss": [],
    "train_acc": [],
    "train_precision": [],
    "train_recall": [],
    "train_f1": [],
    "val_loss": [],
    "val_acc": [],
    "val_precision": [],
    "val_recall": [],
    "val_f1": [],
    "lr": [],
}

Training Epoch Metrics Computation

Source: app/training/engine.py:112-130
avg_loss = running_loss / total
accuracy = correct / total

# Compute precision, recall, F1 (macro average)
all_preds = np.array(all_preds)
all_targets = np.array(all_targets)
precision = precision_score(
    all_targets, all_preds, average="macro", zero_division=0
)
recall = recall_score(all_targets, all_preds, average="macro", zero_division=0)
f1 = f1_score(all_targets, all_preds, average="macro", zero_division=0)

return {
    "train_loss": avg_loss,
    "train_acc": accuracy,
    "train_precision": precision,
    "train_recall": recall,
    "train_f1": f1,
}
All classification metrics (precision, recall, F1) use macro averaging, which computes metrics for each class independently and takes the unweighted mean. This treats all classes equally regardless of size.

Core Metrics Explained

Accuracy

Definition: Percentage of correctly classified samples. Formula: Accuracy = Correct Predictions / Total Predictions When to use:
  • Balanced datasets (all classes have similar sample counts)
  • Quick overall performance assessment
Limitations:
  • Misleading on imbalanced datasets
  • Doesn’t show per-class performance
Example:
  • 90% accuracy on balanced 10-class dataset = excellent
  • 90% accuracy when 90% of data is one class = poor (just predicting majority)

Precision

Definition: Of all samples predicted as a class, what percentage actually belong to that class? Formula: Precision = True Positives / (True Positives + False Positives) Interpretation:
  • High precision: Few false alarms, model is conservative
  • Low precision: Many false alarms, model over-predicts this class
When it matters:
  • When false positives are costly
  • Example: Flagging benign files as malware (user annoyance)

Recall (Sensitivity)

Definition: Of all samples that actually belong to a class, what percentage did we predict correctly? Formula: Recall = True Positives / (True Positives + False Negatives) Interpretation:
  • High recall: Few missed cases, model catches most instances
  • Low recall: Many missed cases, model is too conservative
When it matters:
  • When false negatives are costly
  • Example: Missing actual malware (security risk)

F1 Score

Definition: Harmonic mean of precision and recall. Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall) Interpretation:
  • Balanced metric between precision and recall
  • Good when you care about both false positives and false negatives
  • More suitable than accuracy for imbalanced datasets
When to use:
  • Imbalanced datasets
  • When both precision and recall matter
  • Comparing models fairly across classes

Loss

Definition: Quantifies prediction error using cross-entropy. Interpretation:
  • Lower is better
  • Should decrease during training
  • More sensitive than accuracy to prediction confidence
Use cases:
  • Primary optimization target
  • Early stopping criterion
  • Model selection (choose model with lowest validation loss)

Macro vs. Weighted vs. Micro Averaging

The platform uses macro averaging by default.
Definition: Compute metric for each class, then take unweighted mean.
precision_score(y_true, y_pred, average="macro")
Characteristics:
  • Treats all classes equally
  • Good for imbalanced datasets
  • Shows if model works well across all classes
Example:
  • Class A (1000 samples): Precision = 0.95
  • Class B (100 samples): Precision = 0.60
  • Macro precision = (0.95 + 0.60) / 2 = 0.775
Use when: You want equal importance for all malware families
Definition: Compute metric for each class, then take weighted mean by class size.Characteristics:
  • Weighs classes by support (number of samples)
  • Closer to overall accuracy
  • Large classes dominate the metric
Example:
  • Class A (1000 samples): Precision = 0.95
  • Class B (100 samples): Precision = 0.60
  • Weighted precision = (0.95×1000 + 0.60×100) / 1100 = 0.92
Use when: Larger classes are more important
Definition: Aggregate all predictions, then compute metric globally.Characteristics:
  • Equivalent to accuracy for multi-class
  • Every sample weighted equally
Use when: Rarely needed (use accuracy instead)

Training Loop Evaluation

Per-Epoch Metrics Display

Source: app/training/engine.py:254-263
print(
    f"Epoch {epoch + 1:3d}/{epochs} | "
    f"Train Loss: {train_metrics['train_loss']:.4f} | "
    f"Train Acc: {train_metrics['train_acc'] * 100:.1f}% | "
    f"Val Loss: {val_metrics['val_loss']:.4f} | "
    f"Val Acc: {val_metrics['val_acc'] * 100:.1f}% | "
    f"LR: {current_lr:.6f} | "
    f"Time: {epoch_time:.1f}s" + (" *" if is_best else "")
)
Output example:
Epoch  15/100 | Train Loss: 0.3421 | Train Acc: 89.2% | Val Loss: 0.4156 | Val Acc: 85.7% | LR: 0.000100 | Time: 45.2s *
The * indicates this epoch achieved the best validation loss so far.

Best Model Selection

Source: app/training/engine.py:240-247
# Check for best model
is_best = val_metrics["val_loss"] < self.best_val_loss
if is_best:
    self.best_val_loss = val_metrics["val_loss"]
    self.best_epoch = epoch + 1
    self.epochs_without_improvement = 0
else:
    self.epochs_without_improvement += 1
Best model criterion: Lowest validation loss (not highest accuracy)Why loss instead of accuracy?
  • Loss is more sensitive to prediction confidence
  • Loss reflects probability calibration better
  • Loss captures near-misses that accuracy doesn’t

Test Set Evaluation

After training, run comprehensive evaluation on the held-out test set. Source: app/training/evaluator.py:15-111

Running Test Evaluation

from training.evaluator import run_test_evaluation

results = run_test_evaluation(
    experiment_id="exp_12345",
    model_config=model_config,
    dataset_config=dataset_config
)

Evaluation Pipeline

1

Load Best Checkpoint

Load the model weights from the epoch with lowest validation loss
checkpoint_mgr = CheckpointManager()
checkpoint_path = checkpoint_mgr.get_best_checkpoint(experiment_id)
model = build_model(model_config)
checkpoint_mgr.load_checkpoint(checkpoint_path, model)
2

Create Test DataLoader

Build DataLoader for test set with same preprocessing as training
dataloaders, class_names, _ = create_dataloaders(
    dataset_config,
    {"batch_size": 32},
    num_workers=4,
)
test_loader = dataloaders["test"]
3

Run Inference

Make predictions on all test samples
all_preds = []
all_targets = []

with torch.no_grad():
    for inputs, targets in test_loader:
        inputs = inputs.to(device)
        outputs = model(inputs)
        _, predicted = outputs.max(1)
        
        all_preds.extend(predicted.cpu().numpy())
        all_targets.extend(targets.numpy())
4

Compute Metrics

Calculate confusion matrix, classification report, and per-class metrics

Metrics Returned

Source: app/training/evaluator.py:99-111
return {
    "confusion_matrix": cm.tolist(),
    "classification_report": report,
    "class_names": class_names,
    "per_class": {
        "precision": precision.tolist(),
        "recall": recall.tolist(),
        "f1": f1.tolist(),
        "support": support.tolist(),
    },
    "accuracy": float(accuracy),
    "total_samples": len(all_targets),
}

Confusion Matrix

The confusion matrix shows where the model makes mistakes. Source: app/training/evaluator.py:81
cm = confusion_matrix(all_targets, all_preds)

Interpreting the Confusion Matrix

                 Predicted
              A    B    C    D
Actual   A   95    2    1    2   ← Class A: 95% correct
         B    3   88    7    2   ← Class B: 88% correct
         C    1    5   92    2   ← Class C: 92% correct
         D    4    1    3   92   ← Class D: 92% correct
         ↑    ↑    ↑    ↑
Reading the matrix:
  • Diagonal: Correct predictions (want these high)
  • Off-diagonal: Misclassifications
  • Row: Shows where actual class samples were predicted
  • Column: Shows what the model predicted
Example insights:
  • Row B, Column C = 7: Seven B samples incorrectly classified as C
  • Model confuses B and C more than other pairs → investigate similarity

Classification Report

Source: app/training/evaluator.py:83-89
report = classification_report(
    all_targets,
    all_preds,
    target_names=class_names,
    output_dict=True,
    zero_division=0,
)

Example Report

{
  "FamilyA": {
    "precision": 0.92,
    "recall": 0.95,
    "f1-score": 0.93,
    "support": 100
  },
  "FamilyB": {
    "precision": 0.89,
    "recall": 0.88,
    "f1-score": 0.88,
    "support": 100
  },
  "accuracy": 0.91,
  "macro avg": {
    "precision": 0.90,
    "recall": 0.91,
    "f1-score": 0.90,
    "support": 1000
  },
  "weighted avg": {
    "precision": 0.91,
    "recall": 0.91,
    "f1-score": 0.91,
    "support": 1000
  }
}
Key sections:
  • Per-class metrics: Performance on each malware family
  • Accuracy: Overall accuracy
  • Macro avg: Unweighted average across classes
  • Weighted avg: Weighted by class size
  • Support: Number of samples per class

Interpreting Training Behavior

Healthy Training

Epoch   1/100 | Train Loss: 2.3012 | Train Acc: 15.2% | Val Loss: 2.2891 | Val Acc: 16.1%
Epoch  10/100 | Train Loss: 1.1234 | Train Acc: 62.3% | Val Loss: 1.2456 | Val Acc: 58.7%
Epoch  20/100 | Train Loss: 0.5432 | Train Acc: 82.1% | Val Loss: 0.6234 | Val Acc: 78.9% *
Epoch  30/100 | Train Loss: 0.3421 | Train Acc: 89.2% | Val Loss: 0.5123 | Val Acc: 82.3% *
Epoch  40/100 | Train Loss: 0.2156 | Train Acc: 93.1% | Val Loss: 0.5089 | Val Acc: 82.8% *
Epoch  50/100 | Train Loss: 0.1543 | Train Acc: 95.4% | Val Loss: 0.5201 | Val Acc: 82.5%
Good signs:
  • Loss decreases steadily for both train and val
  • Val accuracy improves over time
  • Gap between train and val is moderate (<10%)
  • Best model found at epoch 40

Overfitting

Epoch  20/100 | Train Loss: 0.5432 | Train Acc: 82.1% | Val Loss: 0.6234 | Val Acc: 78.9% *
Epoch  30/100 | Train Loss: 0.2156 | Train Acc: 93.1% | Val Loss: 0.7123 | Val Acc: 76.3%
Epoch  40/100 | Train Loss: 0.0843 | Train Acc: 98.2% | Val Loss: 0.8901 | Val Acc: 73.1%
Epoch  50/100 | Train Loss: 0.0421 | Train Acc: 99.1% | Val Loss: 1.0234 | Val Acc: 71.2%
Warning signs:
  • Train accuracy increases but val accuracy decreases
  • Val loss increases while train loss decreases
  • Large gap between train and val performance (>15%)
Solutions:
  • Increase dropout
  • Enable/increase L2 regularization
  • Add more data augmentation
  • Use smaller model
  • Enable early stopping (would have stopped at epoch 20)

Underfitting

Epoch  20/100 | Train Loss: 1.8234 | Train Acc: 35.2% | Val Loss: 1.8456 | Val Acc: 34.1%
Epoch  40/100 | Train Loss: 1.7123 | Train Acc: 38.7% | Val Loss: 1.7234 | Val Acc: 37.9%
Epoch  60/100 | Train Loss: 1.6543 | Train Acc: 41.2% | Val Loss: 1.6891 | Val Acc: 40.1%
Epoch  80/100 | Train Loss: 1.6234 | Train Acc: 42.8% | Val Loss: 1.6543 | Val Acc: 41.7%
Warning signs:
  • Both train and val accuracy are low
  • Loss decreases very slowly
  • Performance plateaus at poor level
Solutions:
  • Increase model capacity (more layers/filters)
  • Decrease regularization (lower dropout, remove L2)
  • Increase learning rate
  • Train for more epochs
  • Check data preprocessing

Learning Rate Issues

Too high:
Epoch   1/100 | Train Loss: 2.3012 | Train Acc: 15.2%
Epoch   2/100 | Train Loss: 4.5123 | Train Acc: 12.1%
Epoch   3/100 | Train Loss: NaN | Train Acc: 10.0%
Loss explodes or oscillates wildly → Reduce LR by 10x Too low:
Epoch  20/100 | Train Loss: 2.2891 | Train Acc: 16.3%
Epoch  40/100 | Train Loss: 2.2543 | Train Acc: 17.1%
Epoch  60/100 | Train Loss: 2.2234 | Train Acc: 18.2%
Loss barely decreases → Increase LR by 10x

Model Comparison

When comparing multiple models:

Metrics to Compare

MetricPriorityUse Case
Test AccuracyHighBalanced datasets, quick comparison
Macro F1HighImbalanced datasets, fair comparison
Per-class F1HighIdentify which families are hard
Confusion MatrixMediumUnderstand error patterns
Training TimeMediumProduction constraints
Model SizeLowDeployment on edge devices
Inference SpeedLowReal-time requirements

Example Comparison

ModelTest AccMacro F1Train TimeParametersBest For
Custom CNN82.3%0.8115 min500KSmall datasets
ResNet50 (frozen)89.7%0.8825 min25MGeneral use
ResNet50 (fine-tuned)92.4%0.9160 min25MBest accuracy
EfficientNet-B393.1%0.9280 min12MBalance
Vision Transformer93.8%0.93120 min86MMaximum accuracy

Best Practices

During Training

1

Monitor Both Train and Val

Always watch both metrics. Large divergence = overfitting.
2

Use Validation Loss for Selection

Choose model with lowest validation loss, not highest accuracy.
3

Enable Early Stopping

Patience of 10-15 epochs prevents wasted training time.
4

Save Training History

Keep all metrics for later analysis and comparison.

After Training

1

Always Evaluate on Test Set

Test set gives true performance estimate. Never use validation metrics as final results.
2

Analyze Confusion Matrix

Identify which families are confused → may need more data or better features.
3

Check Per-Class Metrics

Ensure no class is performing significantly worse (F1 < 0.7 while others > 0.9).
4

Compare Multiple Models

Train at least 2-3 models with different architectures/hyperparameters.

Imbalanced Datasets

Never rely solely on accuracy for imbalanced datasets!Always use:
  • Macro F1 score (primary metric)
  • Per-class precision/recall
  • Confusion matrix
A model with 95% accuracy might just be predicting the majority class.

Common Issues

Diagnosis: OverfittingSolutions:
  • Increase dropout to 0.5-0.7
  • Enable L2 regularization (0.0001-0.001)
  • Add more data augmentation
  • Use smaller model
  • Increase dataset size
  • Enable early stopping
Diagnosis: UnderfittingSolutions:
  • Use larger model (more layers/filters)
  • Decrease dropout
  • Remove L2 regularization
  • Increase learning rate
  • Train longer
  • Check data preprocessing/normalization
Diagnosis: Class-specific issuesPossible causes:
  • Insufficient training samples for that class
  • Class is visually similar to others
  • Mislabeled data
Solutions:
  • Collect more data for low-performing classes
  • Use class weights or Focal Loss
  • Increase augmentation for rare classes
  • Review confusion matrix to identify confused pairs
Diagnosis: Severe class imbalance or learning failureSolutions:
  • Use Focal Loss instead of Cross-Entropy
  • Enable weighted sampler
  • Check if dataset is extremely imbalanced
  • Verify learning rate isn’t too high
  • Check if model is actually training (loss decreasing?)

Reporting Results

When documenting model performance, include:

Essential Metrics

  • Test Accuracy: Overall performance
  • Macro F1 Score: Fair comparison across classes
  • Confusion Matrix: Visual error analysis
  • Per-Class Metrics: Precision, recall, F1 for each family

Training Details

  • Model architecture and size
  • Training hyperparameters (LR, optimizer, scheduler)
  • Dataset split (train/val/test sizes)
  • Training duration and best epoch
  • Hardware used (GPU model)

Example Summary

Model: ResNet50 (Transfer Learning, Fine-tuned)
Dataset: 10 malware families, 1000 samples each
Split: 70% train, 15% val, 15% test

Training:
- Optimizer: AdamW (LR=1e-4, weight_decay=0.01)
- Scheduler: Cosine Annealing
- Loss: Focal Loss (gamma=2.0)
- Epochs: 50 (best at epoch 42)
- Duration: 65 minutes on RTX 3080

Test Set Performance:
- Accuracy: 92.4%
- Macro F1: 0.91
- Macro Precision: 0.92
- Macro Recall: 0.91

Per-Class F1 Range: 0.87 - 0.95
Worst Performing: FamilyB (F1=0.87, often confused with FamilyC)
Best Performing: FamilyA (F1=0.95)

Next Steps

Dataset Preparation

Optimize your dataset to improve model performance

Hyperparameter Tuning

Fine-tune training parameters for better results

Build docs developers (and LLMs) love