Results & Evaluation - UC Intel Final

Overview

The Results & Evaluation page (/results) provides comprehensive analysis of completed training experiments. View training curves, test set performance, confusion matrices, and detailed per-class metrics.

Only completed experiments appear on this page. Experiments must finish training (or be manually stopped) to generate evaluation results.

Page Structure

The page displays:

Experiment Counter: Number of completed experiments
Experiment Cards: Expandable cards for each experiment (newest first)
Per-Experiment Analysis: Training curves and advanced metrics within each card

Experiment Selection

Completed experiments display as expandable cards.

Card Label

The collapsed card shows:

Experiment Name: User-defined or auto-generated name
Model Name: Model used for training
Validation Accuracy: Final validation accuracy (e.g., “Val Acc: 87.3%”)

Example:

ResNet50_Baseline | ResNet50_v1 | Val Acc: 87.3%

Cards are sorted by creation time (newest first). Recent experiments appear at the top.

Expanding Cards

Click card to expand and view full analysis.

Summary Section

Top section of expanded card shows experiment overview.

Experiment Info

3 columns:

Column 1: Model
Column 2: Training Config
Column 3: Duration

Model Name: “ResNet50_v1”
Type: “Transfer Learning”, “Custom CNN”, or “Transformer”

Final Metrics Row

5 metric cards showing final validation performance:

Val Loss

Final validation loss (e.g., “0.3456”)

Val Accuracy

Classification accuracy (e.g., “87.3%”)

Val Precision

Macro-averaged precision (e.g., “86.5%”)

Val Recall

Macro-averaged recall (e.g., “85.9%”)

Val F1

Macro-averaged F1 score (e.g., “86.2%”)

Macro-averaging computes metric for each class independently, then averages. This treats all classes equally regardless of size.

Training Curves Tab

First tab shows training history visualizations.

Core Training Metrics (Row 1)

Three charts side-by-side:

Loss
Accuracy
Precision / Recall / F1

Train vs Validation Loss

Blue line: Training loss per epoch
Red line: Validation loss per epoch
X-axis: Epoch number
Y-axis: Loss value

What to look for:

Both curves should decrease over time
Validation loss should follow training loss
Gap between curves = overfitting
Divergence = training instability

Ideal convergence: Loss decreases smoothly, accuracy increases steadily, train/val curves stay close together.

Learning Dynamics (Row 2)

Three additional charts:

Learning Rate
Overfitting Gap
Train vs Val F1

LR Schedule Visualization

Shows learning rate per epoch
Constant: Flat line
ReduceLROnPlateau: Stepped decreases
Cosine Annealing: Smooth cosine curve

What to look for:

LR reductions correlate with loss plateaus
Verify schedule executed as expected

Generalization Gap

Formula: Train Accuracy - Val Accuracy
Positive gap = overfitting (train better than val)
Negative gap = underfitting (val better than train, unusual)

What to look for:

Small gap (0-5%) = good generalization
Growing gap = increasing overfitting
Gap > 10% = concerning overfitting

These charts help diagnose training issues: overfitting, underfitting, learning rate problems, and convergence quality.

Export Section

Bottom of Training Curves tab: Two download buttons:

Download Training History (CSV)
- Exports all epoch-level metrics to CSV
- Columns: epoch, train_loss, train_acc, val_loss, val_acc, learning_rate, etc.
- Import into Excel/Python for custom analysis
Download Model (.pt) (currently disabled)
- Will export trained PyTorch model weights
- Load with torch.load() for inference

Advanced Metrics Tab

Second tab runs test set evaluation and displays detailed performance analysis.

Test evaluation runs automatically when you open this tab. Results are cached for subsequent views.

Test Set Performance

Accuracy Summary Cards: Displays test set metrics in card format:

Test Accuracy: Overall classification accuracy
Test Precision: Macro-averaged precision
Test Recall: Macro-averaged recall
Test F1 Score: Macro-averaged F1

Test metrics are typically slightly lower than validation metrics since the model never saw test data during training.

Confusion Matrix

Left column: Heatmap visualization

Matrix Structure

Rows: True labels (actual classes)
Columns: Predicted labels
Diagonal: Correct predictions (darker = more correct)
Off-diagonal: Misclassifications

Color Scale

Dark colors: High counts
Light colors: Low counts
Helps identify confusion patterns

Interpretation

Strong diagonal: Good performance
Off-diagonal clusters: Specific confusion pairs
Example: Ramnit confused with Lollipop (bright off-diagonal cell)

How to use:

Hover over cells to see exact counts
Identify which classes are frequently confused
Diagonal dominance = good classification

Heavy off-diagonal values indicate systematic misclassification. Investigate whether those classes are visually similar.

Per-Class Metrics

Right column: Bar chart Shows Precision, Recall, F1 for each malware family.

Precision bars: Green
Recall bars: Blue
F1 bars: Purple
X-axis: Class names
Y-axis: Metric value (0-1)

What to look for:

High F1: Model performs well on this class
Low F1: Struggling class (investigate why)
High precision, low recall: Model is cautious (few predictions, mostly correct)
Low precision, high recall: Model is aggressive (many predictions, many wrong)

Identify underperforming classes and consider:

Adding more training samples
Increasing selective augmentation
Reviewing data quality

Classification Report Table

Bottom section: Detailed table Tabular breakdown of per-class metrics:

Class	Precision	Recall	F1-Score	Support
Ramnit	0.87	0.89	0.88	150
Lollipop	0.82	0.79	0.80	142
…	…	…	…	…
Macro Avg	0.85	0.86	0.85	3000
Weighted Avg	0.86	0.87	0.86	3000

Columns:

Class: Malware family name
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1-Score: Harmonic mean of precision and recall
Support: Number of test samples for this class

Averages:

Macro Avg: Unweighted mean (all classes equal)
Weighted Avg: Weighted by support (larger classes matter more)

Use Macro Avg to evaluate overall performance treating all classes equally. Use Weighted Avg if class sizes reflect real-world deployment.

Interpreting Results

Good Performance Indicators

✅ Training curves:

Smooth loss decrease
Steady accuracy increase
Small train/val gap (<5%)

✅ Test metrics:

Test accuracy close to validation accuracy
High F1 scores (>0.85)
Balanced precision and recall

✅ Confusion matrix:

Strong diagonal
Minimal off-diagonal clusters

Poor Performance Indicators

❌ Overfitting:

Train accuracy >> Val accuracy (gap >10%)
Training loss decreases but val loss increases
Solutions: Add regularization, reduce model size, increase dropout, more data augmentation

❌ Underfitting:

Both train and val accuracy are low
Loss plateaus at high value
Solutions: Increase model capacity, train longer, increase learning rate

❌ Class confusion:

Specific off-diagonal clusters in confusion matrix
Low recall for certain classes
Solutions: Collect more data for confused classes, apply selective augmentation, use class weights

Experiment Comparison

Compare multiple experiments to identify best configurations:

Expand Multiple Cards

Open several experiment cards side-by-side (scroll between them)

Compare Metrics

Look at Final Metrics row across experiments

Which has highest val accuracy?
Which has best F1 score?
Which trained fastest?

Analyze Curves

Compare training curves

Which converged faster?
Which shows better generalization?
Which avoided overfitting?

Review Confusion Matrices

Identify which model makes fewer critical misclassifications

Download CSV files for all experiments and create comparison plots in Python/Excel for side-by-side analysis.

Metric Definitions

Accuracy

Formula: (TP + TN) / (TP + TN + FP + FN)

Percentage of correct predictions
Caution: Can be misleading with imbalanced data
Example: 95% accuracy on 95% majority class = useless model

Precision

Formula: TP / (TP + FP)

Of all positive predictions, how many were correct?
High precision = few false alarms
Use case: When false positives are costly

Recall (Sensitivity)

Formula: TP / (TP + FN)

Of all actual positives, how many did we find?
High recall = few missed detections
Use case: When false negatives are costly (e.g., malware detection)

F1 Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of precision and recall
Balances both metrics
Use case: When you need balanced performance

For malware classification, recall is often more important than precision. Missing malware (false negative) is worse than flagging benign files (false positive).

Tips & Best Practices

Focus on F1 Score: It balances precision and recall, providing a single metric for model quality.

Check Overfitting Gap: Train/val gap > 10% indicates overfitting. Add regularization or more data.

Analyze Confusion Matrix: Identify specific class pairs that are confused. This guides data collection and augmentation.

Don’t rely solely on accuracy with imbalanced data. A model predicting only the majority class can have high accuracy but zero utility.

Export History CSV: Download training history for custom visualizations and deeper analysis in Jupyter/Excel.

Next Steps

After analyzing results:

Interpretability Tools

Visualize model attention with Grad-CAM and explore embeddings

Get Started

Core Concepts

Dashboard Guide

Training

Model Interpretability

​Overview

​Page Structure

​Experiment Selection

​Card Label

​Expanding Cards

​Summary Section

​Experiment Info

​Final Metrics Row

Val Loss

Val Accuracy

Val Precision

Val Recall

Val F1

​Training Curves Tab

​Core Training Metrics (Row 1)

​Learning Dynamics (Row 2)

​Export Section

​Advanced Metrics Tab

​Test Set Performance

​Confusion Matrix

​Per-Class Metrics

​Classification Report Table

​Interpreting Results

​Good Performance Indicators

​Poor Performance Indicators

​Experiment Comparison

​Metric Definitions

​Accuracy

​Precision

​Recall (Sensitivity)

​F1 Score

​Tips & Best Practices

​Next Steps

Interpretability Tools

Build docs developers (and LLMs) love

Overview

Page Structure

Experiment Selection

Card Label

Expanding Cards

Summary Section

Experiment Info

Final Metrics Row

Training Curves Tab

Core Training Metrics (Row 1)

Learning Dynamics (Row 2)

Export Section

Advanced Metrics Tab

Test Set Performance

Confusion Matrix

Per-Class Metrics

Classification Report Table

Interpreting Results

Good Performance Indicators

Poor Performance Indicators

Experiment Comparison

Metric Definitions

Accuracy

Precision

Recall (Sensitivity)

F1 Score

Tips & Best Practices

Next Steps