Skip to main content

Experimental Setup

All experiments were executed under controlled conditions to ensure reproducibility and valid hypothesis testing.

Execution Environment

Hardware: NVIDIA T4 GPU with 16GB VRAM
Framework: PyTorch 2.x with CUDA 12.x
Reproducibility: Fixed random seed (seed=42) for all executions
Early Stopping: Patience of 10 epochs on validation loss

Dataset Configuration

MalImg Dataset:
  • Total samples: 9,339
  • Malware families: 25
  • Image format: Grayscale
  • Input size: 224×224 pixels
Data Partitioning:
  • Training: 70% (6,537 samples)
  • Validation: 15% (1,401 samples)
  • Test: 15% (1,401 samples)
  • Stratification: Yes, maintaining class proportions
Class Imbalance: 5 families identified with fewer than 100 samples (minority classes), motivating the data augmentation evaluation in H2.

Experiment 1: Architecture Comparison (H1)

Objective

Verify hypothesis H1: “Pre-trained ResNet50 will outperform both custom CNN and Vision Transformer in accuracy and macro F1-score.”

Model Configurations

Four architectures were evaluated to compare different deep learning approaches:

1. Conventional CNN (Baseline)

  • Architecture: 2 convolutional blocks (standard course configuration)
  • Structure: Conv2D + MaxPool + Conv2D + MaxPool + Flatten + Dense
  • Purpose: Establish baseline comparison
  • Training time: 39 minutes
  • Best epoch: 9

2. VGG-Mini-H1 (5 blocks)

  • Architecture: 5 convolutional blocks (32→64→128→256→512 filters)
  • Each block: Conv2D + BatchNorm + ReLU + MaxPool(2×2)
  • Classifier: GlobalAvgPool + Dropout(0.5) + Dense(256) + Dropout(0.3) + Output
  • Parameters: ~210,000 trainable
  • Training time: 397 minutes
  • Best epoch: 10

3. ResNet50 (Fine-tuning)

  • Backbone: ResNet50 pre-trained on ImageNet
  • Strategy: Partial fine-tuning (20-30 last layers unfrozen)
  • Classifier: Dropout(0.5) + Dense(512) + Output
  • Learning rate: 0.0001 (lower than custom CNN)
  • Optimizer: Adam
  • LR Scheduler: Cosine Annealing
  • Training time: 57 minutes
  • Best epoch: 6

4. Vision Transformer (ViT-Small)

  • Patch size: 16×16
  • Embedding dimension: 384
  • Depth: 12 transformer blocks
  • Attention heads: 6
  • MLP ratio: 4.0
  • Dropout: 0.1
  • Optimizer: AdamW
  • Learning rate: 0.0001
  • Weight decay: 0.01
  • Training time: 76 minutes
  • Best epoch: 10

Hyperparameters

ParameterCNN/BaselineResNet50ViT-Small
OptimizerAdamAdamAdamW
Learning rate0.0010.00010.0001
LR SchedulerReduceLROnPlateauCosine AnnealingCosine Annealing
Batch size323232
Max epochs100100100
Early stopping15 epochs15 epochs20 epochs
Weight decay0.00010.00010.01

Experiment 2: Data Augmentation Impact (H2)

Objective

Verify hypothesis H2: “Data augmentation will improve minority class recall by ≥15 percentage points without substantially degrading global accuracy.”

Configuration

Used the best model from H1 (ResNet50) as base architecture, training two versions:
  1. Without augmentation - baseline
  2. With moderate augmentation - experimental condition

Augmentation Techniques Applied

The augmentation strategy focuses on transformations that preserve the semantic meaning of malware binary representations:Transformations:
  • Orthogonal rotations: 90°, 180°, 270° (random selection)
    • Rationale: Preserves byte-to-pixel correspondence
  • Horizontal flip: 50% probability
  • Vertical flip: 50% probability
  • Brightness adjustment: ±15%
    • Rationale: Maintains relative pixel relationships
  • Contrast adjustment: ±15%
Application:
  • Applied on-the-fly during training
  • Random selection per sample per epoch
  • No additional storage required
Important Consideration: Unlike natural images, malware visualizations require careful augmentation. Arbitrary rotations requiring interpolation are avoided as they modify pixel values and destroy the direct byte representation.

Target Metrics

  • Primary: Recall improvement for 5 minority classes (families with <100 samples)
  • Secondary: Impact on global accuracy (acceptable degradation: ≤2%)
  • Tertiary: Macro F1-score to assess overall balanced performance

Experiment 3: CNN Depth Effect (H3)

Objective

Verify hypothesis H3: “Increasing CNN depth from 3 to 5 blocks will improve F1-score by ≥8 points, but with diminishing returns and higher computational cost.”

Architecture Comparison

Two custom CNN architectures were compared with varying depth:

Shallow CNN (3 blocks)

Structure:
  • Block 1: Conv(32) + Conv(32) + MaxPool + Dropout(0.25)
  • Block 2: Conv(64) + Conv(64) + MaxPool + Dropout(0.25)
  • Block 3: Conv(128) + Conv(128) + MaxPool + Dropout(0.25)
  • Classifier: Flatten + Dense(512) + Dropout(0.5) + Dense(256) + Output
Parameters: ~150,000 trainable

Deep CNN (5 blocks) - H2_MOD.A

Structure:
  • Block 1: Conv(32) + Conv(32) + MaxPool
  • Block 2: Conv(64) + Conv(64) + MaxPool
  • Block 3: Conv(128) + Conv(128) + MaxPool + Dropout(0.25)
  • Block 4: Conv(256) + MaxPool
  • Block 5: Conv(512) + MaxPool
  • Classifier: Flatten + Dense(512) + Dropout(0.5) + Dense(256) + Output
Parameters: ~210,000 trainable (+40% vs 3-block)

Controlled Variables

Both models trained with identical hyperparameters:
  • Optimizer: Adam (lr=0.001)
  • Epochs: 10
  • Batch size: 32
  • Loss function: Categorical Cross-Entropy
  • Activation: ReLU
  • No augmentation applied to isolate depth effect

Measured Aspects

  1. Performance Metrics:
    • Validation accuracy
    • Test accuracy
    • Macro F1-score
    • Train/validation loss
  2. Generalization Analysis:
    • Generalization gap (train accuracy - validation accuracy)
    • Train/Val loss ratio
    • Learning curve stability
  3. Computational Cost:
    • Training time per epoch
    • Total training time
    • GPU memory usage
    • FLOPs (floating-point operations)

Experimental Design Rationale

The experiment follows the ceteris paribus principle (all else constant), where only network depth varies between conditions, allowing isolation of its specific effect on performance and efficiency.

Reproducibility Considerations

All experiments were designed for reproducibility:

Fixed Elements

  • Random seed: 42
  • Dataset splits saved and reused
  • Hyperparameters documented
  • Model architectures version-controlled

Tracked Metrics

  • Training/validation loss per epoch
  • Training/validation accuracy per epoch
  • Precision, recall, F1-score (macro and weighted)
  • Confusion matrices
  • Learning curves
  • Training time
  • GPU memory consumption

Code Organization

# Example experiment tracking structure
experiments/
├── h1_architecture_comparison/
│   ├── baseline_cnn/
│   ├── vgg_mini_h1/
│   ├── resnet50_finetuned/
│   └── vit_small/
├── h2_augmentation_impact/
│   ├── without_augmentation/
│   └── with_augmentation/
└── h3_depth_effect/
    ├── cnn_3_blocks/
    └── cnn_5_blocks/

Data Collection

For each experiment, the following data was systematically collected:

Training Metrics

  • Loss and accuracy curves (training and validation)
  • Best epoch identification
  • Early stopping triggers
  • Learning rate adjustments

Evaluation Metrics

  • Test set accuracy
  • Per-class precision, recall, F1-score
  • Confusion matrix
  • Macro and weighted averages

Computational Metrics

  • Training time (total and per epoch)
  • Inference time
  • GPU memory usage
  • Model size (parameters and disk storage)

Qualitative Analysis

  • Activation maps (Grad-CAM)
  • Feature visualizations
  • Misclassification analysis
  • T-SNE embeddings of learned features

All experimental results, including detailed tables, figures, and statistical analysis, are presented in the Results section.

Build docs developers (and LLMs) love