Skip to main content

Overview

The UC Intel Final platform provides comprehensive dataset handling capabilities for malware image classification. The dataset preparation system handles scanning, splitting, augmentation, and loading of malware binary visualizations.

Dataset Structure

The platform expects datasets organized in a directory structure where each subdirectory represents a malware family:
dataset/
├── Family1/
│   ├── sample1.png
│   ├── sample2.png
│   └── ...
├── Family2/
│   ├── sample1.png
│   └── ...
└── Family3/
    └── ...
Supported image formats: .png, .jpg, .jpeg, .bmp

Scanning Datasets

The scan_dataset() function automatically discovers images and labels from your dataset directory. Source: app/training/dataset.py:37-60
def scan_dataset(
    dataset_path: Path, selected_families: list[str] | None = None
) -> tuple[list[Path], list[int], list[str]]:
    """Scan dataset directory and return image paths, labels, and class names."""
    image_paths = []
    labels = []
    class_names = []

    # Get all family directories
    family_dirs = sorted([d for d in dataset_path.iterdir() if d.is_dir()])

    # Filter if selected_families specified
    if selected_families:
        family_dirs = [d for d in family_dirs if d.name in selected_families]

    for class_idx, family_dir in enumerate(family_dirs):
        class_names.append(family_dir.name)
        # Get all images in this family
        for img_file in family_dir.iterdir():
            if img_file.suffix.lower() in [".png", ".jpg", ".jpeg", ".bmp"]:
                image_paths.append(img_file)
                labels.append(class_idx)

    return image_paths, labels, class_names
The function returns three values:
  • image_paths: List of Path objects to each image
  • labels: Integer labels (0 to num_classes-1) for each image
  • class_names: Sorted list of malware family names

Creating Train/Val/Test Splits

The platform uses stratified splitting to maintain class distribution across splits. Source: app/training/dataset.py:63-96

Split Function

def create_splits(
    image_paths: list[Path],
    labels: list[int],
    train_ratio: float = 0.7,
    val_ratio: float = 0.15,
    test_ratio: float = 0.15,
    stratified: bool = True,
    random_seed: int = 72,
) -> dict:
    """Create train/val/test splits."""
    # First split: train vs (val+test)
    train_paths, temp_paths, train_labels, temp_labels = train_test_split(
        image_paths,
        labels,
        test_size=(val_ratio + test_ratio),
        random_state=random_seed,
        stratify=labels if stratified else None,
    )

    # Second split: val vs test
    val_test_ratio = test_ratio / (val_ratio + test_ratio)
    val_paths, test_paths, val_labels, test_labels = train_test_split(
        temp_paths,
        temp_labels,
        test_size=val_test_ratio,
        random_state=random_seed,
        stratify=temp_labels if stratified else None,
    )

    return {
        "train": {"paths": train_paths, "labels": train_labels},
        "val": {"paths": val_paths, "labels": val_labels},
        "test": {"paths": test_paths, "labels": test_labels},
    }
1

First Split

Separate training data from validation + test data using the specified train_ratio
2

Second Split

Split the remaining data into validation and test sets
3

Stratification

When stratified=True, maintains class distribution in all splits to prevent bias

Configuration Example

split_config = {
    "train": 70,        # 70% for training
    "val": 15,          # 15% for validation
    "test": 15,         # 15% for testing
    "stratified": True, # Maintain class distribution
    "random_seed": 72   # For reproducibility
}

Data Augmentation

Augmentation helps prevent overfitting and improves model generalization. The platform supports preset and custom augmentation strategies. Source: app/training/transforms.py:6-90

Augmentation Presets

  • Random horizontal flip (50% probability)
  • Random 90-degree rotations (0°, 90°, 180°, 270°)
  • Suitable for datasets with moderate diversity
  • Random horizontal flip (50%)
  • Random vertical flip (50%)
  • Random 90-degree rotations
  • Color jitter: brightness ±10%, contrast ±10%
  • Balanced approach for most use cases
  • All moderate augmentations
  • Increased color jitter: brightness ±20%, contrast ±20%
  • Gaussian blur (kernel=3, sigma=0.1-0.5)
  • Use with small or highly imbalanced datasets

Custom Augmentation

augmentation_config = {
    "preset": "Custom",
    "custom": {
        "horizontal_flip": True,
        "vertical_flip": True,
        "rotation": True,
        "rotation_angles": [90, 180, 270],
        "brightness_range": 20,  # ±20%
        "contrast_range": 20,    # ±20%
        "gaussian_noise": True
    }
}

Transform Pipeline

The create_train_transforms() function builds a transform pipeline:
def create_train_transforms(config: dict) -> transforms.Compose:
    transform_list = []
    
    # 1. Resize to target size
    transform_list.append(transforms.Resize(target_size))
    
    # 2. Color mode conversion (if needed)
    if color_mode == "Grayscale":
        transform_list.append(transforms.Grayscale(num_output_channels=3))
    
    # 3. Augmentation transforms (based on preset/custom)
    if preset == "Moderate":
        transform_list.extend([
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomVerticalFlip(p=0.5),
            transforms.RandomChoice([...]),  # Rotations
            transforms.ColorJitter(brightness=0.1, contrast=0.1),
        ])
    
    # 4. Convert to tensor
    transform_list.append(transforms.ToTensor())
    
    # 5. Normalization
    if normalization == "ImageNet Mean/Std":
        transform_list.append(
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225]
            )
        )
    
    return transforms.Compose(transform_list)
Validation and test sets should NOT use augmentation. Use create_val_transforms() which only applies resizing and normalization.

Handling Class Imbalance

Malware datasets often have imbalanced class distributions. The platform provides two strategies:

1. Class Weights

Compute inverse frequency weights to penalize misclassification of rare classes more heavily. Source: app/training/dataset.py:99-110
def compute_class_weights(labels: list[int], num_classes: int) -> torch.Tensor:
    """Compute inverse frequency class weights for imbalanced data."""
    counter = Counter(labels)
    total = len(labels)

    weights = []
    for i in range(num_classes):
        count = counter.get(i, 1)
        weight = total / (num_classes * count)
        weights.append(weight)

    return torch.tensor(weights, dtype=torch.float32)

2. Weighted Random Sampler

Oversample minority classes during training to balance batch composition. Source: app/training/dataset.py:113-126
def create_weighted_sampler(
    labels: list[int], num_classes: int
) -> WeightedRandomSampler:
    """Create weighted random sampler for imbalanced data."""
    class_weights = compute_class_weights(labels, num_classes)

    # Assign weight to each sample
    sample_weights = [class_weights[label].item() for label in labels]

    return WeightedRandomSampler(
        weights=sample_weights,
        num_samples=len(labels),
        replacement=True,
    )
When to use each strategy:
  • Class Weights: Use with Cross-Entropy or Focal Loss when you want to keep natural class distribution but penalize errors on rare classes
  • Weighted Sampler: Use when you want balanced batches by oversampling minority classes (can increase training time)

PyTorch Dataset and DataLoader

MalwareDataset Class

Source: app/training/dataset.py:13-34
class MalwareDataset(Dataset):
    """PyTorch Dataset for malware images."""

    def __init__(self, image_paths: list[Path], labels: list[int], transform=None):
        self.image_paths = image_paths
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        label = self.labels[idx]

        # Load image
        image = Image.open(img_path).convert("RGB")

        if self.transform:
            image = self.transform(image)

        return image, label

Creating DataLoaders

The create_dataloaders() function orchestrates the entire pipeline: Source: app/training/dataset.py:129-249
dataloaders, class_names, class_weights = create_dataloaders(
    dataset_config={
        "dataset_path": "dataset",
        "selected_families": None,  # or ["Family1", "Family2"]
        "split": {
            "train": 70,
            "val": 15,
            "test": 15,
            "stratified": True,
            "random_seed": 72
        },
        "preprocessing": {
            "target_size": (224, 224),
            "normalization": "ImageNet Mean/Std",
            "color_mode": "RGB"
        },
        "augmentation": {
            "preset": "Moderate"
        }
    },
    training_config={
        "batch_size": 32,
        "class_weights": "Auto Class Weights"
    },
    num_workers=4
)
Returns:
  • dataloaders: Dictionary with 'train', 'val', and 'test' DataLoader objects
  • class_names: List of malware family names
  • class_weights: Tensor of class weights (or None)
1

Scan Dataset

Discover all images and create label mappings
2

Create Splits

Split data into train/val/test with stratification
3

Create Transforms

Build augmentation pipelines for training and validation
4

Create Datasets

Instantiate PyTorch Dataset objects for each split
5

Compute Class Weights

Calculate weights for imbalanced data handling (if enabled)
6

Create DataLoaders

Build DataLoader objects with proper batching and sampling

Best Practices

Split Ratios

  • 70/15/15: Standard split for moderate-sized datasets (1000+ samples per class)
  • 80/10/10: Use when you have larger datasets and want more training data
  • 60/20/20: Use when validation is critical and dataset is smaller

Batch Size

  • 32: Good default for most GPUs
  • 64-128: Use with larger GPUs and simpler models
  • 16-8: Use with limited memory or very large models

Augmentation Strategy

  • Start with Light or Moderate augmentation
  • Use Heavy only if you observe significant overfitting
  • For malware binaries, geometric transforms (rotation, flip) are usually more important than color transforms

Random Seed

  • Always set random_seed for reproducibility
  • Use the same seed across experiments for fair comparison
  • Document the seed in experiment logs

Common Issues

Solutions:
  • Reduce batch_size
  • Reduce num_workers (try 2 or 0)
  • Reduce target_size (e.g., from 224 to 128)
  • Disable pin_memory if using MPS/CPU
Solutions:
  • Increase num_workers (try 4-8 on multi-core CPUs)
  • Enable pin_memory when using CUDA
  • Convert images to a faster format (PNG is good, avoid BMP)
  • Use smaller image sizes if possible
Solutions:
  • Enable stratified splitting
  • Use WeightedRandomSampler with class_weights="Auto Class Weights"
  • Increase batch size for better class distribution

Next Steps

Model Selection

Learn how to choose and configure model architectures

Hyperparameters

Optimize training hyperparameters for best performance

Build docs developers (and LLMs) love