Overview
Model evaluation measures how well your trained model performs on unseen data. The UC Intel Final platform tracks multiple metrics during training and provides comprehensive test set evaluation.Training Metrics
During training, the platform computes and tracks metrics for both training and validation sets. Source:app/training/engine.py:63-130
Metrics Tracked Per Epoch
The training engine computes five key metrics:Training Epoch Metrics Computation
Source:app/training/engine.py:112-130
All classification metrics (precision, recall, F1) use macro averaging, which computes metrics for each class independently and takes the unweighted mean. This treats all classes equally regardless of size.
Core Metrics Explained
Accuracy
Definition: Percentage of correctly classified samples. Formula:Accuracy = Correct Predictions / Total Predictions
When to use:
- Balanced datasets (all classes have similar sample counts)
- Quick overall performance assessment
- Misleading on imbalanced datasets
- Doesn’t show per-class performance
- 90% accuracy on balanced 10-class dataset = excellent
- 90% accuracy when 90% of data is one class = poor (just predicting majority)
Precision
Definition: Of all samples predicted as a class, what percentage actually belong to that class? Formula:Precision = True Positives / (True Positives + False Positives)
Interpretation:
- High precision: Few false alarms, model is conservative
- Low precision: Many false alarms, model over-predicts this class
- When false positives are costly
- Example: Flagging benign files as malware (user annoyance)
Recall (Sensitivity)
Definition: Of all samples that actually belong to a class, what percentage did we predict correctly? Formula:Recall = True Positives / (True Positives + False Negatives)
Interpretation:
- High recall: Few missed cases, model catches most instances
- Low recall: Many missed cases, model is too conservative
- When false negatives are costly
- Example: Missing actual malware (security risk)
F1 Score
Definition: Harmonic mean of precision and recall. Formula:F1 = 2 × (Precision × Recall) / (Precision + Recall)
Interpretation:
- Balanced metric between precision and recall
- Good when you care about both false positives and false negatives
- More suitable than accuracy for imbalanced datasets
- Imbalanced datasets
- When both precision and recall matter
- Comparing models fairly across classes
Loss
Definition: Quantifies prediction error using cross-entropy. Interpretation:- Lower is better
- Should decrease during training
- More sensitive than accuracy to prediction confidence
- Primary optimization target
- Early stopping criterion
- Model selection (choose model with lowest validation loss)
Macro vs. Weighted vs. Micro Averaging
The platform uses macro averaging by default.Macro Average (Used in Platform)
Macro Average (Used in Platform)
Definition: Compute metric for each class, then take unweighted mean.Characteristics:
- Treats all classes equally
- Good for imbalanced datasets
- Shows if model works well across all classes
- Class A (1000 samples): Precision = 0.95
- Class B (100 samples): Precision = 0.60
- Macro precision = (0.95 + 0.60) / 2 = 0.775
Weighted Average
Weighted Average
Definition: Compute metric for each class, then take weighted mean by class size.Characteristics:
- Weighs classes by support (number of samples)
- Closer to overall accuracy
- Large classes dominate the metric
- Class A (1000 samples): Precision = 0.95
- Class B (100 samples): Precision = 0.60
- Weighted precision = (0.95×1000 + 0.60×100) / 1100 = 0.92
Micro Average
Micro Average
Definition: Aggregate all predictions, then compute metric globally.Characteristics:
- Equivalent to accuracy for multi-class
- Every sample weighted equally
Training Loop Evaluation
Per-Epoch Metrics Display
Source:app/training/engine.py:254-263
* indicates this epoch achieved the best validation loss so far.
Best Model Selection
Source:app/training/engine.py:240-247
Best model criterion: Lowest validation loss (not highest accuracy)Why loss instead of accuracy?
- Loss is more sensitive to prediction confidence
- Loss reflects probability calibration better
- Loss captures near-misses that accuracy doesn’t
Test Set Evaluation
After training, run comprehensive evaluation on the held-out test set. Source:app/training/evaluator.py:15-111
Running Test Evaluation
Evaluation Pipeline
Metrics Returned
Source:app/training/evaluator.py:99-111
Confusion Matrix
The confusion matrix shows where the model makes mistakes. Source:app/training/evaluator.py:81
Interpreting the Confusion Matrix
- Diagonal: Correct predictions (want these high)
- Off-diagonal: Misclassifications
- Row: Shows where actual class samples were predicted
- Column: Shows what the model predicted
- Row B, Column C = 7: Seven B samples incorrectly classified as C
- Model confuses B and C more than other pairs → investigate similarity
Classification Report
Source:app/training/evaluator.py:83-89
Example Report
- Per-class metrics: Performance on each malware family
- Accuracy: Overall accuracy
- Macro avg: Unweighted average across classes
- Weighted avg: Weighted by class size
- Support: Number of samples per class
Interpreting Training Behavior
Healthy Training
- Loss decreases steadily for both train and val
- Val accuracy improves over time
- Gap between train and val is moderate (<10%)
- Best model found at epoch 40
Overfitting
- Train accuracy increases but val accuracy decreases
- Val loss increases while train loss decreases
- Large gap between train and val performance (>15%)
- Increase dropout
- Enable/increase L2 regularization
- Add more data augmentation
- Use smaller model
- Enable early stopping (would have stopped at epoch 20)
Underfitting
- Both train and val accuracy are low
- Loss decreases very slowly
- Performance plateaus at poor level
- Increase model capacity (more layers/filters)
- Decrease regularization (lower dropout, remove L2)
- Increase learning rate
- Train for more epochs
- Check data preprocessing
Learning Rate Issues
Too high:Model Comparison
When comparing multiple models:Metrics to Compare
| Metric | Priority | Use Case |
|---|---|---|
| Test Accuracy | High | Balanced datasets, quick comparison |
| Macro F1 | High | Imbalanced datasets, fair comparison |
| Per-class F1 | High | Identify which families are hard |
| Confusion Matrix | Medium | Understand error patterns |
| Training Time | Medium | Production constraints |
| Model Size | Low | Deployment on edge devices |
| Inference Speed | Low | Real-time requirements |
Example Comparison
| Model | Test Acc | Macro F1 | Train Time | Parameters | Best For |
|---|---|---|---|---|---|
| Custom CNN | 82.3% | 0.81 | 15 min | 500K | Small datasets |
| ResNet50 (frozen) | 89.7% | 0.88 | 25 min | 25M | General use |
| ResNet50 (fine-tuned) | 92.4% | 0.91 | 60 min | 25M | Best accuracy |
| EfficientNet-B3 | 93.1% | 0.92 | 80 min | 12M | Balance |
| Vision Transformer | 93.8% | 0.93 | 120 min | 86M | Maximum accuracy |
Best Practices
During Training
After Training
Always Evaluate on Test Set
Test set gives true performance estimate. Never use validation metrics as final results.
Analyze Confusion Matrix
Identify which families are confused → may need more data or better features.
Check Per-Class Metrics
Ensure no class is performing significantly worse (F1 < 0.7 while others > 0.9).
Imbalanced Datasets
Common Issues
High Train Acc, Low Val Acc
High Train Acc, Low Val Acc
Diagnosis: OverfittingSolutions:
- Increase dropout to 0.5-0.7
- Enable L2 regularization (0.0001-0.001)
- Add more data augmentation
- Use smaller model
- Increase dataset size
- Enable early stopping
Low Acc on Both Train and Val
Low Acc on Both Train and Val
Diagnosis: UnderfittingSolutions:
- Use larger model (more layers/filters)
- Decrease dropout
- Remove L2 regularization
- Increase learning rate
- Train longer
- Check data preprocessing/normalization
Some Classes Have Low F1
Some Classes Have Low F1
Diagnosis: Class-specific issuesPossible causes:
- Insufficient training samples for that class
- Class is visually similar to others
- Mislabeled data
- Collect more data for low-performing classes
- Use class weights or Focal Loss
- Increase augmentation for rare classes
- Review confusion matrix to identify confused pairs
Model Predicts Only One Class
Model Predicts Only One Class
Diagnosis: Severe class imbalance or learning failureSolutions:
- Use Focal Loss instead of Cross-Entropy
- Enable weighted sampler
- Check if dataset is extremely imbalanced
- Verify learning rate isn’t too high
- Check if model is actually training (loss decreasing?)
Reporting Results
When documenting model performance, include:Essential Metrics
- Test Accuracy: Overall performance
- Macro F1 Score: Fair comparison across classes
- Confusion Matrix: Visual error analysis
- Per-Class Metrics: Precision, recall, F1 for each family
Training Details
- Model architecture and size
- Training hyperparameters (LR, optimizer, scheduler)
- Dataset split (train/val/test sizes)
- Training duration and best epoch
- Hardware used (GPU model)
Example Summary
Next Steps
Dataset Preparation
Optimize your dataset to improve model performance
Hyperparameter Tuning
Fine-tune training parameters for better results