Experiments Conducted

Experimental Setup

All experiments were executed under controlled conditions to ensure reproducibility and valid hypothesis testing.

Execution Environment

Hardware: NVIDIA T4 GPU with 16GB VRAM
Framework: PyTorch 2.x with CUDA 12.x
Reproducibility: Fixed random seed (seed=42) for all executions
Early Stopping: Patience of 10 epochs on validation loss

Dataset Configuration

MalImg Dataset:

Total samples: 9,339
Malware families: 25
Image format: Grayscale
Input size: 224×224 pixels

Data Partitioning:

Training: 70% (6,537 samples)
Validation: 15% (1,401 samples)
Test: 15% (1,401 samples)
Stratification: Yes, maintaining class proportions

Class Imbalance: 5 families identified with fewer than 100 samples (minority classes), motivating the data augmentation evaluation in H2.

Experiment 1: Architecture Comparison (H1)

Objective

Verify hypothesis H1: “Pre-trained ResNet50 will outperform both custom CNN and Vision Transformer in accuracy and macro F1-score.”

Model Configurations

Four architectures were evaluated to compare different deep learning approaches:

1. Conventional CNN (Baseline)

View Configuration

Architecture: 2 convolutional blocks (standard course configuration)
Structure: Conv2D + MaxPool + Conv2D + MaxPool + Flatten + Dense
Purpose: Establish baseline comparison
Training time: 39 minutes
Best epoch: 9

2. VGG-Mini-H1 (5 blocks)

View Configuration

Architecture: 5 convolutional blocks (32→64→128→256→512 filters)
Each block: Conv2D + BatchNorm + ReLU + MaxPool(2×2)
Classifier: GlobalAvgPool + Dropout(0.5) + Dense(256) + Dropout(0.3) + Output
Parameters: ~210,000 trainable
Training time: 397 minutes
Best epoch: 10

3. ResNet50 (Fine-tuning)

View Configuration

Backbone: ResNet50 pre-trained on ImageNet
Strategy: Partial fine-tuning (20-30 last layers unfrozen)
Classifier: Dropout(0.5) + Dense(512) + Output
Learning rate: 0.0001 (lower than custom CNN)
Optimizer: Adam
LR Scheduler: Cosine Annealing
Training time: 57 minutes
Best epoch: 6

4. Vision Transformer (ViT-Small)

View Configuration

Patch size: 16×16
Embedding dimension: 384
Depth: 12 transformer blocks
Attention heads: 6
MLP ratio: 4.0
Dropout: 0.1
Optimizer: AdamW
Learning rate: 0.0001
Weight decay: 0.01
Training time: 76 minutes
Best epoch: 10

Hyperparameters

Parameter	CNN/Baseline	ResNet50	ViT-Small
Optimizer	Adam	Adam	AdamW
Learning rate	0.001	0.0001	0.0001
LR Scheduler	ReduceLROnPlateau	Cosine Annealing	Cosine Annealing
Batch size	32	32	32
Max epochs	100	100	100
Early stopping	15 epochs	15 epochs	20 epochs
Weight decay	0.0001	0.0001	0.01

Experiment 2: Data Augmentation Impact (H2)

Objective

Verify hypothesis H2: “Data augmentation will improve minority class recall by ≥15 percentage points without substantially degrading global accuracy.”

Configuration

Used the best model from H1 (ResNet50) as base architecture, training two versions:

Without augmentation - baseline
With moderate augmentation - experimental condition

Augmentation Techniques Applied

View Augmentation Configuration

The augmentation strategy focuses on transformations that preserve the semantic meaning of malware binary representations:Transformations:

Orthogonal rotations: 90°, 180°, 270° (random selection)
- Rationale: Preserves byte-to-pixel correspondence
Horizontal flip: 50% probability
Vertical flip: 50% probability
Brightness adjustment: ±15%
- Rationale: Maintains relative pixel relationships
Contrast adjustment: ±15%

Application:

Applied on-the-fly during training
Random selection per sample per epoch
No additional storage required

Important Consideration: Unlike natural images, malware visualizations require careful augmentation. Arbitrary rotations requiring interpolation are avoided as they modify pixel values and destroy the direct byte representation.

Target Metrics

Primary: Recall improvement for 5 minority classes (families with <100 samples)
Secondary: Impact on global accuracy (acceptable degradation: ≤2%)
Tertiary: Macro F1-score to assess overall balanced performance

Experiment 3: CNN Depth Effect (H3)

Objective

Verify hypothesis H3: “Increasing CNN depth from 3 to 5 blocks will improve F1-score by ≥8 points, but with diminishing returns and higher computational cost.”

Architecture Comparison

Two custom CNN architectures were compared with varying depth:

Shallow CNN (3 blocks)

View Architecture

Structure:

Block 1: Conv(32) + Conv(32) + MaxPool + Dropout(0.25)
Block 2: Conv(64) + Conv(64) + MaxPool + Dropout(0.25)
Block 3: Conv(128) + Conv(128) + MaxPool + Dropout(0.25)
Classifier: Flatten + Dense(512) + Dropout(0.5) + Dense(256) + Output

Parameters: ~150,000 trainable

Deep CNN (5 blocks) - H2_MOD.A

View Architecture

Structure:

Block 1: Conv(32) + Conv(32) + MaxPool
Block 2: Conv(64) + Conv(64) + MaxPool
Block 3: Conv(128) + Conv(128) + MaxPool + Dropout(0.25)
Block 4: Conv(256) + MaxPool
Block 5: Conv(512) + MaxPool
Classifier: Flatten + Dense(512) + Dropout(0.5) + Dense(256) + Output

Parameters: ~210,000 trainable (+40% vs 3-block)

Controlled Variables

Both models trained with identical hyperparameters:

Optimizer: Adam (lr=0.001)
Epochs: 10
Batch size: 32
Loss function: Categorical Cross-Entropy
Activation: ReLU
No augmentation applied to isolate depth effect

Measured Aspects

Performance Metrics:
- Validation accuracy
- Test accuracy
- Macro F1-score
- Train/validation loss
Generalization Analysis:
- Generalization gap (train accuracy - validation accuracy)
- Train/Val loss ratio
- Learning curve stability
Computational Cost:
- Training time per epoch
- Total training time
- GPU memory usage
- FLOPs (floating-point operations)

Experimental Design Rationale

The experiment follows the ceteris paribus principle (all else constant), where only network depth varies between conditions, allowing isolation of its specific effect on performance and efficiency.

Reproducibility Considerations

All experiments were designed for reproducibility:

Fixed Elements

Random seed: 42
Dataset splits saved and reused
Hyperparameters documented
Model architectures version-controlled

Tracked Metrics

Training/validation loss per epoch
Training/validation accuracy per epoch
Precision, recall, F1-score (macro and weighted)
Confusion matrices
Learning curves
Training time
GPU memory consumption

Code Organization

# Example experiment tracking structure
experiments/
├── h1_architecture_comparison/
│   ├── baseline_cnn/
│   ├── vgg_mini_h1/
│   ├── resnet50_finetuned/
│   └── vit_small/
├── h2_augmentation_impact/
│   ├── without_augmentation/
│   └── with_augmentation/
└── h3_depth_effect/
    ├── cnn_3_blocks/
    └── cnn_5_blocks/

Data Collection

For each experiment, the following data was systematically collected:

Training Metrics

Loss and accuracy curves (training and validation)
Best epoch identification
Early stopping triggers
Learning rate adjustments

Evaluation Metrics

Test set accuracy
Per-class precision, recall, F1-score
Confusion matrix
Macro and weighted averages

Computational Metrics

Training time (total and per epoch)
Inference time
GPU memory usage
Model size (parameters and disk storage)

Qualitative Analysis

Activation maps (Grad-CAM)
Feature visualizations
Misclassification analysis
T-SNE embeddings of learned features

All experimental results, including detailed tables, figures, and statistical analysis, are presented in the Results section.

Academic Project

Experimental Setup

Execution Environment

Dataset Configuration

Experiment 1: Architecture Comparison (H1)

Objective

Model Configurations

1. Conventional CNN (Baseline)

2. VGG-Mini-H1 (5 blocks)

3. ResNet50 (Fine-tuning)

4. Vision Transformer (ViT-Small)

Hyperparameters

Experiment 2: Data Augmentation Impact (H2)

Objective

Configuration

Augmentation Techniques Applied

Target Metrics

Experiment 3: CNN Depth Effect (H3)

Objective

Architecture Comparison

Shallow CNN (3 blocks)

Deep CNN (5 blocks) - H2_MOD.A

Controlled Variables

Measured Aspects

Experimental Design Rationale

Reproducibility Considerations

Fixed Elements

Tracked Metrics

Code Organization

Data Collection

Training Metrics

Evaluation Metrics

Computational Metrics

Qualitative Analysis

Build docs developers (and LLMs) love

Academic Project

​Experimental Setup

​Execution Environment

​Dataset Configuration

​Experiment 1: Architecture Comparison (H1)

​Objective

​Model Configurations

​1. Conventional CNN (Baseline)

​2. VGG-Mini-H1 (5 blocks)

​3. ResNet50 (Fine-tuning)

​4. Vision Transformer (ViT-Small)

​Hyperparameters

​Experiment 2: Data Augmentation Impact (H2)

​Objective

​Configuration

​Augmentation Techniques Applied

​Target Metrics

​Experiment 3: CNN Depth Effect (H3)

​Objective

​Architecture Comparison

​Shallow CNN (3 blocks)

​Deep CNN (5 blocks) - H2_MOD.A

​Controlled Variables

​Measured Aspects

​Experimental Design Rationale

​Reproducibility Considerations

​Fixed Elements

​Tracked Metrics

​Code Organization

​Data Collection

​Training Metrics

​Evaluation Metrics

​Computational Metrics

​Qualitative Analysis

Build docs developers (and LLMs) love

Experimental Setup

Execution Environment

Dataset Configuration

Experiment 1: Architecture Comparison (H1)

Objective

Model Configurations

1. Conventional CNN (Baseline)

2. VGG-Mini-H1 (5 blocks)

3. ResNet50 (Fine-tuning)

4. Vision Transformer (ViT-Small)

Hyperparameters

Experiment 2: Data Augmentation Impact (H2)

Objective

Configuration

Augmentation Techniques Applied

Target Metrics

Experiment 3: CNN Depth Effect (H3)

Objective

Architecture Comparison

Shallow CNN (3 blocks)

Deep CNN (5 blocks) - H2_MOD.A

Controlled Variables

Measured Aspects

Experimental Design Rationale

Reproducibility Considerations

Fixed Elements

Tracked Metrics

Code Organization

Data Collection

Training Metrics

Evaluation Metrics

Computational Metrics

Qualitative Analysis