Skip to main content

Overview

The Dataset Configuration page (/dataset) provides a comprehensive interface for preparing your malware image dataset for training. It automatically scans the dataset directory and provides tools for class selection, data splitting, augmentation, and imbalance handling.
The dataset is automatically scanned from repo/malware/ on page load. All malware families are detected and indexed with sample counts.

Page Structure

The Dataset page is organized into 4 tabs:
  1. Overview & Split - Dataset statistics and train/val/test configuration
  2. Class Distribution - Class selection and distribution visualization
  3. Samples & Preprocessing - Sample viewer and preprocessing preview
  4. Augmentation - Data augmentation settings and configuration save

Tab 1: Overview & Split

Dataset Overview

Displays key statistics about your dataset:
  • Total Samples: Total number of images across all classes
  • Number of Classes: Count of unique malware families
  • Dataset Location: Relative path to dataset directory
  • Class Imbalance Ratio: Max/min samples ratio (warns if >2x)
The dashboard automatically calculates class imbalance and warns if the ratio exceeds 2:1, indicating potential training bias.

Train/Validation/Test Split

Two split methods are available:

Fixed Split (Default)

1

Set Training Percentage

Use the slider to allocate data for training (0-100%)
  • Default: 70%
2

Set Validation Percentage

From the remaining data, allocate for validation
  • Default: 50% of remaining (15% of total)
  • Rest automatically goes to test set
3

Configure Options

  • Stratified Split: Maintains class proportions in each split
  • Random Seed: For reproducible splits (default: 73)
Example Split:
  • Train: 70% → 70% of 10,000 = 7,000 samples
  • Val: 50% of remaining 30% → 15% of 10,000 = 1,500 samples
  • Test: Remaining 15% → 1,500 samples
A pie chart visualizes the final split distribution.

K-Fold Cross-Validation

Enable Use Cross-Validation for advanced validation:
  • Number of Folds (K): 2-20 (typically 5 or 10)
  • Stratified K-Fold: Maintains class proportions per fold
  • Training per iteration: (K-1)/K of data
  • Validation per iteration: 1/K of data
Cross-validation runs K training iterations, rotating validation folds. Final metrics are averaged across folds.

Class Imbalance Handling

The dashboard provides several strategies to handle imbalanced classes:

Tab 2: Class Distribution

Class Selection Interface

Controls:
  • Select All / Deselect All: Quick selection buttons
  • Min samples per class: Filter classes by sample count threshold
  • Multi-select dropdown: Search and select individual classes
Use the “Min samples per class” filter to automatically select only well-represented classes. For example, set to 100 to exclude rare malware families.

Selection Summary

Displays metrics based on selected classes: For Fixed Split:
  • Selected Classes count
  • Total Samples
  • Train samples (with percentage)
  • Val samples (with percentage)
  • Test samples (with percentage)
For Cross-Validation:
  • Selected Classes count
  • Total Samples
  • K-Folds count

Distribution Visualization

Original Distribution Chart

Grouped bar chart showing samples per malware family:
  • Fixed Split: Stacked bars with Train/Val/Test breakdown
    • Green bars: Training samples
    • Blue bars: Validation samples
    • Orange bars: Test samples
  • Cross-Validation: Single bar showing total samples used in CV

Effective Distribution (After Balancing)

Shows the effective contribution of each class after applying imbalance handling:
  • Auto Class Weights: All classes contribute equally (flat bars)
  • Selective Augmentation: Minority classes show increased samples
  • Manual Weights: Bars scaled by custom weights
  • SMOTE: Minority classes boosted to target ratio
  • Undersampling: All classes reduced to smallest size
Improvement Metrics:
  • Original Ratio (before balancing)
  • Effective Ratio (after balancing)
  • Imbalance Reduction percentage

Top/Bottom Classes

Two columns showing:
  • Most Common Selected Classes: Top 5 by sample count
  • Least Common Selected Classes: Bottom 5 by sample count

Tab 3: Samples & Preprocessing

Preprocessing Preview

Family Selector: Choose malware family to preview Preprocessing Options:
  • 224x224 (default)
  • 256x256
  • 299x299
  • 512x512
All images are resized to this dimension using LANCZOS resampling.
Side-by-side comparison:
  • Original Image: As-is from dataset with original dimensions
  • After Preprocessing: Resized and color-converted preview

Dataset Samples

Displays a 6x6 grid (36 images) showing random samples across all classes.
  • Each image shows dimensions below (e.g., “128x128”)
  • Grid samples are cached in session state to prevent re-randomizing on rerun
  • Images displayed at 150px width
The sample grid helps verify image quality and diversity before training.

Tab 4: Augmentation

Augmentation Presets

Choose from predefined presets or create custom configuration:
No augmentation applied
Augmentation is applied during training, not during dataset preparation. This allows on-the-fly augmentation with minimal disk space.

Custom Augmentation Configuration

When Custom preset is selected: Geometric Transforms:
  • Horizontal Flip: Flip left-right (safe for malware images)
  • Vertical Flip: Flip top-bottom
  • Orthogonal Rotation: Select from [90°, 180°, 270°]
    • Only 90° multiples (lossless, no interpolation)
Photometric Transforms:
  • Brightness Adjustment: ±0-50% brightness change
  • Contrast Adjustment: ±0-50% contrast change
  • Gaussian Noise: Add random Gaussian noise
Orthogonal rotations (90°/180°/270°) are lossless because they don’t require interpolation. Avoid arbitrary angles for malware images.

Augmentation Preview

Select Family to Preview: Choose malware family Displays 3 side-by-side images:
  1. After Preprocessing: Base image (resized, normalized)
  2. Augmented (Example 1): Random augmentation applied
  3. Augmented (Example 2): Different random augmentation
Refresh Button: Generate new random augmentation examples
Augmentations are applied randomly, so each example shows different transforms. Refresh multiple times to see the full range of augmentations.

Configuration Summary & Save

The final section (bottom of Tab 4) displays:

Summary Metrics

3 metric cards:
  • Selected Classes: Number of families included
  • Total Samples: Sum across selected classes
  • Method: “5-Fold CV” or “Augmentation Preset”

Save Configuration Button

💾 Save Configuration (primary button, centered)
  • Validates all settings
  • Saves to state/workflow.py session state
  • Shows success message with balloons 🎈
  • Enables Model page navigation
After saving, a green checkmark (✅) appears in the sidebar next to “Dataset configured”.

Full Configuration JSON

Expandable section showing complete config structure:
{
  "dataset_path": "repo/malware",
  "total_samples": 10000,
  "num_classes": 25,
  "selected_families": ["Ramnit", "Lollipop", ...],
  "split": {
    "method": "fixed_split",
    "train": 70.0,
    "val": 15.0,
    "test": 15.0,
    "stratified": true,
    "random_seed": 73
  },
  "augmentation": {
    "preset": "Moderate"
  },
  "preprocessing": {
    "target_size": [224, 224],
    "normalization": "[0,1]",
    "color_mode": "RGB"
  },
  "imbalance_handling": {
    "strategy": "Auto Class Weights (Recommended)",
    "class_weights": null,
    "smote_ratio": null
  }
}

Tips & Best Practices

Start with Auto Class Weights: This is the safest and most effective approach for handling class imbalance without data manipulation.
Use Stratified Splits: Always enable stratified splitting to maintain class proportions across train/val/test sets.
Preview Before Training: Check the augmentation preview to ensure transforms are appropriate for malware images.
Avoid SMOTE for image data. It creates unrealistic blended pixels. Use Auto Class Weights instead.

Next Steps

After saving your dataset configuration:

Model Builder

Design your neural network architecture in the Model page

Build docs developers (and LLMs) love