Overview
The UC Intel Final platform provides comprehensive dataset handling capabilities for malware image classification. The dataset preparation system handles scanning, splitting, augmentation, and loading of malware binary visualizations.Dataset Structure
The platform expects datasets organized in a directory structure where each subdirectory represents a malware family:.png, .jpg, .jpeg, .bmp
Scanning Datasets
Thescan_dataset() function automatically discovers images and labels from your dataset directory.
Source: app/training/dataset.py:37-60
The function returns three values:
image_paths: List of Path objects to each imagelabels: Integer labels (0 to num_classes-1) for each imageclass_names: Sorted list of malware family names
Creating Train/Val/Test Splits
The platform uses stratified splitting to maintain class distribution across splits. Source:app/training/dataset.py:63-96
Split Function
Configuration Example
Data Augmentation
Augmentation helps prevent overfitting and improves model generalization. The platform supports preset and custom augmentation strategies. Source:app/training/transforms.py:6-90
Augmentation Presets
Light Augmentation
Light Augmentation
- Random horizontal flip (50% probability)
- Random 90-degree rotations (0°, 90°, 180°, 270°)
- Suitable for datasets with moderate diversity
Moderate Augmentation
Moderate Augmentation
- Random horizontal flip (50%)
- Random vertical flip (50%)
- Random 90-degree rotations
- Color jitter: brightness ±10%, contrast ±10%
- Balanced approach for most use cases
Heavy Augmentation
Heavy Augmentation
- All moderate augmentations
- Increased color jitter: brightness ±20%, contrast ±20%
- Gaussian blur (kernel=3, sigma=0.1-0.5)
- Use with small or highly imbalanced datasets
Custom Augmentation
Transform Pipeline
Thecreate_train_transforms() function builds a transform pipeline:
Handling Class Imbalance
Malware datasets often have imbalanced class distributions. The platform provides two strategies:1. Class Weights
Compute inverse frequency weights to penalize misclassification of rare classes more heavily. Source:app/training/dataset.py:99-110
2. Weighted Random Sampler
Oversample minority classes during training to balance batch composition. Source:app/training/dataset.py:113-126
When to use each strategy:
- Class Weights: Use with Cross-Entropy or Focal Loss when you want to keep natural class distribution but penalize errors on rare classes
- Weighted Sampler: Use when you want balanced batches by oversampling minority classes (can increase training time)
PyTorch Dataset and DataLoader
MalwareDataset Class
Source:app/training/dataset.py:13-34
Creating DataLoaders
Thecreate_dataloaders() function orchestrates the entire pipeline:
Source: app/training/dataset.py:129-249
dataloaders: Dictionary with'train','val', and'test'DataLoader objectsclass_names: List of malware family namesclass_weights: Tensor of class weights (or None)
Best Practices
Split Ratios
- 70/15/15: Standard split for moderate-sized datasets (1000+ samples per class)
- 80/10/10: Use when you have larger datasets and want more training data
- 60/20/20: Use when validation is critical and dataset is smaller
Batch Size
- 32: Good default for most GPUs
- 64-128: Use with larger GPUs and simpler models
- 16-8: Use with limited memory or very large models
Augmentation Strategy
- Start with Light or Moderate augmentation
- Use Heavy only if you observe significant overfitting
- For malware binaries, geometric transforms (rotation, flip) are usually more important than color transforms
Random Seed
- Always set
random_seedfor reproducibility - Use the same seed across experiments for fair comparison
- Document the seed in experiment logs
Common Issues
Out of Memory Errors
Out of Memory Errors
Solutions:
- Reduce
batch_size - Reduce
num_workers(try 2 or 0) - Reduce
target_size(e.g., from 224 to 128) - Disable
pin_memoryif using MPS/CPU
Slow Data Loading
Slow Data Loading
Solutions:
- Increase
num_workers(try 4-8 on multi-core CPUs) - Enable
pin_memorywhen using CUDA - Convert images to a faster format (PNG is good, avoid BMP)
- Use smaller image sizes if possible
Unbalanced Batches
Unbalanced Batches
Solutions:
- Enable stratified splitting
- Use WeightedRandomSampler with
class_weights="Auto Class Weights" - Increase batch size for better class distribution
Next Steps
Model Selection
Learn how to choose and configure model architectures
Hyperparameters
Optimize training hyperparameters for best performance