Overview
The Dataset Configuration page (/dataset) provides a comprehensive interface for preparing your malware image dataset for training. It automatically scans the dataset directory and provides tools for class selection, data splitting, augmentation, and imbalance handling.
The dataset is automatically scanned from
repo/malware/ on page load. All malware families are detected and indexed with sample counts.Page Structure
The Dataset page is organized into 4 tabs:- Overview & Split - Dataset statistics and train/val/test configuration
- Class Distribution - Class selection and distribution visualization
- Samples & Preprocessing - Sample viewer and preprocessing preview
- Augmentation - Data augmentation settings and configuration save
Tab 1: Overview & Split
Dataset Overview
Displays key statistics about your dataset:- Total Samples: Total number of images across all classes
- Number of Classes: Count of unique malware families
- Dataset Location: Relative path to dataset directory
- Class Imbalance Ratio: Max/min samples ratio (warns if >2x)
Train/Validation/Test Split
Two split methods are available:Fixed Split (Default)
Set Validation Percentage
From the remaining data, allocate for validation
- Default: 50% of remaining (15% of total)
- Rest automatically goes to test set
- Train: 70% → 70% of 10,000 = 7,000 samples
- Val: 50% of remaining 30% → 15% of 10,000 = 1,500 samples
- Test: Remaining 15% → 1,500 samples
K-Fold Cross-Validation
Enable Use Cross-Validation for advanced validation:- Number of Folds (K): 2-20 (typically 5 or 10)
- Stratified K-Fold: Maintains class proportions per fold
- Training per iteration: (K-1)/K of data
- Validation per iteration: 1/K of data
Cross-validation runs K training iterations, rotating validation folds. Final metrics are averaged across folds.
Class Imbalance Handling
The dashboard provides several strategies to handle imbalanced classes:- Auto Class Weights (Recommended)
- Selective Augmentation (H2)
- Manual Class Weights
- Oversampling (SMOTE)
- Undersampling
- No Adjustment
Automatically calculates class weights inversely proportional to frequencies.
- Classes with fewer samples get higher weights
- Balanced loss function during training
- No data duplication or removal
Tab 2: Class Distribution
Class Selection Interface
Controls:- Select All / Deselect All: Quick selection buttons
- Min samples per class: Filter classes by sample count threshold
- Multi-select dropdown: Search and select individual classes
Selection Summary
Displays metrics based on selected classes: For Fixed Split:- Selected Classes count
- Total Samples
- Train samples (with percentage)
- Val samples (with percentage)
- Test samples (with percentage)
- Selected Classes count
- Total Samples
- K-Folds count
Distribution Visualization
Original Distribution Chart
Grouped bar chart showing samples per malware family:- Fixed Split: Stacked bars with Train/Val/Test breakdown
- Green bars: Training samples
- Blue bars: Validation samples
- Orange bars: Test samples
- Cross-Validation: Single bar showing total samples used in CV
Effective Distribution (After Balancing)
Shows the effective contribution of each class after applying imbalance handling:- Auto Class Weights: All classes contribute equally (flat bars)
- Selective Augmentation: Minority classes show increased samples
- Manual Weights: Bars scaled by custom weights
- SMOTE: Minority classes boosted to target ratio
- Undersampling: All classes reduced to smallest size
- Original Ratio (before balancing)
- Effective Ratio (after balancing)
- Imbalance Reduction percentage
Top/Bottom Classes
Two columns showing:- Most Common Selected Classes: Top 5 by sample count
- Least Common Selected Classes: Bottom 5 by sample count
Tab 3: Samples & Preprocessing
Preprocessing Preview
Family Selector: Choose malware family to preview Preprocessing Options:- Target Size
- Normalization
- Color Mode
- 224x224 (default)
- 256x256
- 299x299
- 512x512
- Original Image: As-is from dataset with original dimensions
- After Preprocessing: Resized and color-converted preview
Dataset Samples
Displays a 6x6 grid (36 images) showing random samples across all classes.- Each image shows dimensions below (e.g., “128x128”)
- Grid samples are cached in session state to prevent re-randomizing on rerun
- Images displayed at 150px width
The sample grid helps verify image quality and diversity before training.
Tab 4: Augmentation
Augmentation Presets
Choose from predefined presets or create custom configuration:- None
- Light
- Moderate
- Heavy
- Custom
No augmentation applied
Augmentation is applied during training, not during dataset preparation. This allows on-the-fly augmentation with minimal disk space.
Custom Augmentation Configuration
When Custom preset is selected: Geometric Transforms:- Horizontal Flip: Flip left-right (safe for malware images)
- Vertical Flip: Flip top-bottom
- Orthogonal Rotation: Select from [90°, 180°, 270°]
- Only 90° multiples (lossless, no interpolation)
- Brightness Adjustment: ±0-50% brightness change
- Contrast Adjustment: ±0-50% contrast change
- Gaussian Noise: Add random Gaussian noise
Augmentation Preview
Select Family to Preview: Choose malware family Displays 3 side-by-side images:- After Preprocessing: Base image (resized, normalized)
- Augmented (Example 1): Random augmentation applied
- Augmented (Example 2): Different random augmentation
Augmentations are applied randomly, so each example shows different transforms. Refresh multiple times to see the full range of augmentations.
Configuration Summary & Save
The final section (bottom of Tab 4) displays:Summary Metrics
3 metric cards:- Selected Classes: Number of families included
- Total Samples: Sum across selected classes
- Method: “5-Fold CV” or “Augmentation Preset”
Save Configuration Button
💾 Save Configuration (primary button, centered)- Validates all settings
- Saves to
state/workflow.pysession state - Shows success message with balloons 🎈
- Enables Model page navigation
After saving, a green checkmark (✅) appears in the sidebar next to “Dataset configured”.
Full Configuration JSON
Expandable section showing complete config structure:Tips & Best Practices
Next Steps
After saving your dataset configuration:Model Builder
Design your neural network architecture in the Model page