Skip to main content

Overview

The Results & Evaluation page (/results) provides comprehensive analysis of completed training experiments. View training curves, test set performance, confusion matrices, and detailed per-class metrics.
Only completed experiments appear on this page. Experiments must finish training (or be manually stopped) to generate evaluation results.

Page Structure

The page displays:
  • Experiment Counter: Number of completed experiments
  • Experiment Cards: Expandable cards for each experiment (newest first)
  • Per-Experiment Analysis: Training curves and advanced metrics within each card

Experiment Selection

Completed experiments display as expandable cards.

Card Label

The collapsed card shows:
  • Experiment Name: User-defined or auto-generated name
  • Model Name: Model used for training
  • Validation Accuracy: Final validation accuracy (e.g., “Val Acc: 87.3%”)
Example:
ResNet50_Baseline | ResNet50_v1 | Val Acc: 87.3%
Cards are sorted by creation time (newest first). Recent experiments appear at the top.

Expanding Cards

Click card to expand and view full analysis.

Summary Section

Top section of expanded card shows experiment overview.

Experiment Info

3 columns:
  • Model Name: “ResNet50_v1”
  • Type: “Transfer Learning”, “Custom CNN”, or “Transformer”

Final Metrics Row

5 metric cards showing final validation performance:

Val Loss

Final validation loss (e.g., “0.3456”)

Val Accuracy

Classification accuracy (e.g., “87.3%”)

Val Precision

Macro-averaged precision (e.g., “86.5%”)

Val Recall

Macro-averaged recall (e.g., “85.9%”)

Val F1

Macro-averaged F1 score (e.g., “86.2%”)
Macro-averaging computes metric for each class independently, then averages. This treats all classes equally regardless of size.

Training Curves Tab

First tab shows training history visualizations.

Core Training Metrics (Row 1)

Three charts side-by-side:
Train vs Validation Loss
  • Blue line: Training loss per epoch
  • Red line: Validation loss per epoch
  • X-axis: Epoch number
  • Y-axis: Loss value
What to look for:
  • Both curves should decrease over time
  • Validation loss should follow training loss
  • Gap between curves = overfitting
  • Divergence = training instability
Ideal convergence: Loss decreases smoothly, accuracy increases steadily, train/val curves stay close together.

Learning Dynamics (Row 2)

Three additional charts:
LR Schedule Visualization
  • Shows learning rate per epoch
  • Constant: Flat line
  • ReduceLROnPlateau: Stepped decreases
  • Cosine Annealing: Smooth cosine curve
What to look for:
  • LR reductions correlate with loss plateaus
  • Verify schedule executed as expected
These charts help diagnose training issues: overfitting, underfitting, learning rate problems, and convergence quality.

Export Section

Bottom of Training Curves tab: Two download buttons:
  1. Download Training History (CSV)
    • Exports all epoch-level metrics to CSV
    • Columns: epoch, train_loss, train_acc, val_loss, val_acc, learning_rate, etc.
    • Import into Excel/Python for custom analysis
  2. Download Model (.pt) (currently disabled)
    • Will export trained PyTorch model weights
    • Load with torch.load() for inference

Advanced Metrics Tab

Second tab runs test set evaluation and displays detailed performance analysis.
Test evaluation runs automatically when you open this tab. Results are cached for subsequent views.

Test Set Performance

Accuracy Summary Cards: Displays test set metrics in card format:
  • Test Accuracy: Overall classification accuracy
  • Test Precision: Macro-averaged precision
  • Test Recall: Macro-averaged recall
  • Test F1 Score: Macro-averaged F1
Test metrics are typically slightly lower than validation metrics since the model never saw test data during training.

Confusion Matrix

Left column: Heatmap visualization
1

Matrix Structure

  • Rows: True labels (actual classes)
  • Columns: Predicted labels
  • Diagonal: Correct predictions (darker = more correct)
  • Off-diagonal: Misclassifications
2

Color Scale

  • Dark colors: High counts
  • Light colors: Low counts
  • Helps identify confusion patterns
3

Interpretation

  • Strong diagonal: Good performance
  • Off-diagonal clusters: Specific confusion pairs
  • Example: Ramnit confused with Lollipop (bright off-diagonal cell)
How to use:
  • Hover over cells to see exact counts
  • Identify which classes are frequently confused
  • Diagonal dominance = good classification
Heavy off-diagonal values indicate systematic misclassification. Investigate whether those classes are visually similar.

Per-Class Metrics

Right column: Bar chart Shows Precision, Recall, F1 for each malware family.
  • Precision bars: Green
  • Recall bars: Blue
  • F1 bars: Purple
  • X-axis: Class names
  • Y-axis: Metric value (0-1)
What to look for:
  • High F1: Model performs well on this class
  • Low F1: Struggling class (investigate why)
  • High precision, low recall: Model is cautious (few predictions, mostly correct)
  • Low precision, high recall: Model is aggressive (many predictions, many wrong)
Identify underperforming classes and consider:
  • Adding more training samples
  • Increasing selective augmentation
  • Reviewing data quality

Classification Report Table

Bottom section: Detailed table Tabular breakdown of per-class metrics:
ClassPrecisionRecallF1-ScoreSupport
Ramnit0.870.890.88150
Lollipop0.820.790.80142
Macro Avg0.850.860.853000
Weighted Avg0.860.870.863000
Columns:
  • Class: Malware family name
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1-Score: Harmonic mean of precision and recall
  • Support: Number of test samples for this class
Averages:
  • Macro Avg: Unweighted mean (all classes equal)
  • Weighted Avg: Weighted by support (larger classes matter more)
Use Macro Avg to evaluate overall performance treating all classes equally. Use Weighted Avg if class sizes reflect real-world deployment.

Interpreting Results

Good Performance Indicators

Training curves:
  • Smooth loss decrease
  • Steady accuracy increase
  • Small train/val gap (<5%)
Test metrics:
  • Test accuracy close to validation accuracy
  • High F1 scores (>0.85)
  • Balanced precision and recall
Confusion matrix:
  • Strong diagonal
  • Minimal off-diagonal clusters

Poor Performance Indicators

Overfitting:
  • Train accuracy >> Val accuracy (gap >10%)
  • Training loss decreases but val loss increases
  • Solutions: Add regularization, reduce model size, increase dropout, more data augmentation
Underfitting:
  • Both train and val accuracy are low
  • Loss plateaus at high value
  • Solutions: Increase model capacity, train longer, increase learning rate
Class confusion:
  • Specific off-diagonal clusters in confusion matrix
  • Low recall for certain classes
  • Solutions: Collect more data for confused classes, apply selective augmentation, use class weights

Experiment Comparison

Compare multiple experiments to identify best configurations:
1

Expand Multiple Cards

Open several experiment cards side-by-side (scroll between them)
2

Compare Metrics

Look at Final Metrics row across experiments
  • Which has highest val accuracy?
  • Which has best F1 score?
  • Which trained fastest?
3

Analyze Curves

Compare training curves
  • Which converged faster?
  • Which shows better generalization?
  • Which avoided overfitting?
4

Review Confusion Matrices

Identify which model makes fewer critical misclassifications
Download CSV files for all experiments and create comparison plots in Python/Excel for side-by-side analysis.

Metric Definitions

Accuracy

Formula: (TP + TN) / (TP + TN + FP + FN)
  • Percentage of correct predictions
  • Caution: Can be misleading with imbalanced data
  • Example: 95% accuracy on 95% majority class = useless model

Precision

Formula: TP / (TP + FP)
  • Of all positive predictions, how many were correct?
  • High precision = few false alarms
  • Use case: When false positives are costly

Recall (Sensitivity)

Formula: TP / (TP + FN)
  • Of all actual positives, how many did we find?
  • High recall = few missed detections
  • Use case: When false negatives are costly (e.g., malware detection)

F1 Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)
  • Harmonic mean of precision and recall
  • Balances both metrics
  • Use case: When you need balanced performance
For malware classification, recall is often more important than precision. Missing malware (false negative) is worse than flagging benign files (false positive).

Tips & Best Practices

Focus on F1 Score: It balances precision and recall, providing a single metric for model quality.
Check Overfitting Gap: Train/val gap > 10% indicates overfitting. Add regularization or more data.
Analyze Confusion Matrix: Identify specific class pairs that are confused. This guides data collection and augmentation.
Don’t rely solely on accuracy with imbalanced data. A model predicting only the majority class can have high accuracy but zero utility.
Export History CSV: Download training history for custom visualizations and deeper analysis in Jupyter/Excel.

Next Steps

After analyzing results:

Interpretability Tools

Visualize model attention with Grad-CAM and explore embeddings

Build docs developers (and LLMs) love