Overview
The Results & Evaluation page (/results) provides comprehensive analysis of completed training experiments. View training curves, test set performance, confusion matrices, and detailed per-class metrics.
Only completed experiments appear on this page. Experiments must finish training (or be manually stopped) to generate evaluation results.
Page Structure
The page displays:- Experiment Counter: Number of completed experiments
- Experiment Cards: Expandable cards for each experiment (newest first)
- Per-Experiment Analysis: Training curves and advanced metrics within each card
Experiment Selection
Completed experiments display as expandable cards.Card Label
The collapsed card shows:- Experiment Name: User-defined or auto-generated name
- Model Name: Model used for training
- Validation Accuracy: Final validation accuracy (e.g., “Val Acc: 87.3%”)
Expanding Cards
Click card to expand and view full analysis.Summary Section
Top section of expanded card shows experiment overview.Experiment Info
3 columns:- Column 1: Model
- Column 2: Training Config
- Column 3: Duration
- Model Name: “ResNet50_v1”
- Type: “Transfer Learning”, “Custom CNN”, or “Transformer”
Final Metrics Row
5 metric cards showing final validation performance:Val Loss
Final validation loss (e.g., “0.3456”)
Val Accuracy
Classification accuracy (e.g., “87.3%”)
Val Precision
Macro-averaged precision (e.g., “86.5%”)
Val Recall
Macro-averaged recall (e.g., “85.9%”)
Val F1
Macro-averaged F1 score (e.g., “86.2%”)
Macro-averaging computes metric for each class independently, then averages. This treats all classes equally regardless of size.
Training Curves Tab
First tab shows training history visualizations.Core Training Metrics (Row 1)
Three charts side-by-side:- Loss
- Accuracy
- Precision / Recall / F1
Train vs Validation Loss
- Blue line: Training loss per epoch
- Red line: Validation loss per epoch
- X-axis: Epoch number
- Y-axis: Loss value
- Both curves should decrease over time
- Validation loss should follow training loss
- Gap between curves = overfitting
- Divergence = training instability
Learning Dynamics (Row 2)
Three additional charts:- Learning Rate
- Overfitting Gap
- Train vs Val F1
LR Schedule Visualization
- Shows learning rate per epoch
- Constant: Flat line
- ReduceLROnPlateau: Stepped decreases
- Cosine Annealing: Smooth cosine curve
- LR reductions correlate with loss plateaus
- Verify schedule executed as expected
These charts help diagnose training issues: overfitting, underfitting, learning rate problems, and convergence quality.
Export Section
Bottom of Training Curves tab: Two download buttons:-
Download Training History (CSV)
- Exports all epoch-level metrics to CSV
- Columns: epoch, train_loss, train_acc, val_loss, val_acc, learning_rate, etc.
- Import into Excel/Python for custom analysis
-
Download Model (.pt) (currently disabled)
- Will export trained PyTorch model weights
- Load with
torch.load()for inference
Advanced Metrics Tab
Second tab runs test set evaluation and displays detailed performance analysis.Test evaluation runs automatically when you open this tab. Results are cached for subsequent views.
Test Set Performance
Accuracy Summary Cards: Displays test set metrics in card format:- Test Accuracy: Overall classification accuracy
- Test Precision: Macro-averaged precision
- Test Recall: Macro-averaged recall
- Test F1 Score: Macro-averaged F1
Confusion Matrix
Left column: Heatmap visualizationMatrix Structure
- Rows: True labels (actual classes)
- Columns: Predicted labels
- Diagonal: Correct predictions (darker = more correct)
- Off-diagonal: Misclassifications
- Hover over cells to see exact counts
- Identify which classes are frequently confused
- Diagonal dominance = good classification
Per-Class Metrics
Right column: Bar chart Shows Precision, Recall, F1 for each malware family.- Precision bars: Green
- Recall bars: Blue
- F1 bars: Purple
- X-axis: Class names
- Y-axis: Metric value (0-1)
- High F1: Model performs well on this class
- Low F1: Struggling class (investigate why)
- High precision, low recall: Model is cautious (few predictions, mostly correct)
- Low precision, high recall: Model is aggressive (many predictions, many wrong)
Classification Report Table
Bottom section: Detailed table Tabular breakdown of per-class metrics:| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Ramnit | 0.87 | 0.89 | 0.88 | 150 |
| Lollipop | 0.82 | 0.79 | 0.80 | 142 |
| … | … | … | … | … |
| Macro Avg | 0.85 | 0.86 | 0.85 | 3000 |
| Weighted Avg | 0.86 | 0.87 | 0.86 | 3000 |
- Class: Malware family name
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1-Score: Harmonic mean of precision and recall
- Support: Number of test samples for this class
- Macro Avg: Unweighted mean (all classes equal)
- Weighted Avg: Weighted by support (larger classes matter more)
Use Macro Avg to evaluate overall performance treating all classes equally. Use Weighted Avg if class sizes reflect real-world deployment.
Interpreting Results
Good Performance Indicators
✅ Training curves:- Smooth loss decrease
- Steady accuracy increase
- Small train/val gap (<5%)
- Test accuracy close to validation accuracy
- High F1 scores (>0.85)
- Balanced precision and recall
- Strong diagonal
- Minimal off-diagonal clusters
Poor Performance Indicators
❌ Overfitting:- Train accuracy >> Val accuracy (gap >10%)
- Training loss decreases but val loss increases
- Solutions: Add regularization, reduce model size, increase dropout, more data augmentation
- Both train and val accuracy are low
- Loss plateaus at high value
- Solutions: Increase model capacity, train longer, increase learning rate
- Specific off-diagonal clusters in confusion matrix
- Low recall for certain classes
- Solutions: Collect more data for confused classes, apply selective augmentation, use class weights
Experiment Comparison
Compare multiple experiments to identify best configurations:Compare Metrics
Look at Final Metrics row across experiments
- Which has highest val accuracy?
- Which has best F1 score?
- Which trained fastest?
Analyze Curves
Compare training curves
- Which converged faster?
- Which shows better generalization?
- Which avoided overfitting?
Metric Definitions
Accuracy
Formula:(TP + TN) / (TP + TN + FP + FN)
- Percentage of correct predictions
- Caution: Can be misleading with imbalanced data
- Example: 95% accuracy on 95% majority class = useless model
Precision
Formula:TP / (TP + FP)
- Of all positive predictions, how many were correct?
- High precision = few false alarms
- Use case: When false positives are costly
Recall (Sensitivity)
Formula:TP / (TP + FN)
- Of all actual positives, how many did we find?
- High recall = few missed detections
- Use case: When false negatives are costly (e.g., malware detection)
F1 Score
Formula:2 × (Precision × Recall) / (Precision + Recall)
- Harmonic mean of precision and recall
- Balances both metrics
- Use case: When you need balanced performance
For malware classification, recall is often more important than precision. Missing malware (false negative) is worse than flagging benign files (false positive).
Tips & Best Practices
Next Steps
After analyzing results:Interpretability Tools
Visualize model attention with Grad-CAM and explore embeddings