Experimental Setup
All experiments were executed under controlled conditions to ensure reproducibility and valid hypothesis testing.Execution Environment
Hardware: NVIDIA T4 GPU with 16GB VRAM
Framework: PyTorch 2.x with CUDA 12.x
Reproducibility: Fixed random seed (seed=42) for all executions
Early Stopping: Patience of 10 epochs on validation loss
Framework: PyTorch 2.x with CUDA 12.x
Reproducibility: Fixed random seed (seed=42) for all executions
Early Stopping: Patience of 10 epochs on validation loss
Dataset Configuration
MalImg Dataset:- Total samples: 9,339
- Malware families: 25
- Image format: Grayscale
- Input size: 224×224 pixels
- Training: 70% (6,537 samples)
- Validation: 15% (1,401 samples)
- Test: 15% (1,401 samples)
- Stratification: Yes, maintaining class proportions
Class Imbalance: 5 families identified with fewer than 100 samples (minority classes), motivating the data augmentation evaluation in H2.
Experiment 1: Architecture Comparison (H1)
Objective
Verify hypothesis H1: “Pre-trained ResNet50 will outperform both custom CNN and Vision Transformer in accuracy and macro F1-score.”Model Configurations
Four architectures were evaluated to compare different deep learning approaches:1. Conventional CNN (Baseline)
View Configuration
View Configuration
- Architecture: 2 convolutional blocks (standard course configuration)
- Structure: Conv2D + MaxPool + Conv2D + MaxPool + Flatten + Dense
- Purpose: Establish baseline comparison
- Training time: 39 minutes
- Best epoch: 9
2. VGG-Mini-H1 (5 blocks)
View Configuration
View Configuration
- Architecture: 5 convolutional blocks (32→64→128→256→512 filters)
- Each block: Conv2D + BatchNorm + ReLU + MaxPool(2×2)
- Classifier: GlobalAvgPool + Dropout(0.5) + Dense(256) + Dropout(0.3) + Output
- Parameters: ~210,000 trainable
- Training time: 397 minutes
- Best epoch: 10
3. ResNet50 (Fine-tuning)
View Configuration
View Configuration
- Backbone: ResNet50 pre-trained on ImageNet
- Strategy: Partial fine-tuning (20-30 last layers unfrozen)
- Classifier: Dropout(0.5) + Dense(512) + Output
- Learning rate: 0.0001 (lower than custom CNN)
- Optimizer: Adam
- LR Scheduler: Cosine Annealing
- Training time: 57 minutes
- Best epoch: 6
4. Vision Transformer (ViT-Small)
View Configuration
View Configuration
- Patch size: 16×16
- Embedding dimension: 384
- Depth: 12 transformer blocks
- Attention heads: 6
- MLP ratio: 4.0
- Dropout: 0.1
- Optimizer: AdamW
- Learning rate: 0.0001
- Weight decay: 0.01
- Training time: 76 minutes
- Best epoch: 10
Hyperparameters
| Parameter | CNN/Baseline | ResNet50 | ViT-Small |
|---|---|---|---|
| Optimizer | Adam | Adam | AdamW |
| Learning rate | 0.001 | 0.0001 | 0.0001 |
| LR Scheduler | ReduceLROnPlateau | Cosine Annealing | Cosine Annealing |
| Batch size | 32 | 32 | 32 |
| Max epochs | 100 | 100 | 100 |
| Early stopping | 15 epochs | 15 epochs | 20 epochs |
| Weight decay | 0.0001 | 0.0001 | 0.01 |
Experiment 2: Data Augmentation Impact (H2)
Objective
Verify hypothesis H2: “Data augmentation will improve minority class recall by ≥15 percentage points without substantially degrading global accuracy.”Configuration
Used the best model from H1 (ResNet50) as base architecture, training two versions:- Without augmentation - baseline
- With moderate augmentation - experimental condition
Augmentation Techniques Applied
View Augmentation Configuration
View Augmentation Configuration
The augmentation strategy focuses on transformations that preserve the semantic meaning of malware binary representations:Transformations:
- Orthogonal rotations: 90°, 180°, 270° (random selection)
- Rationale: Preserves byte-to-pixel correspondence
- Horizontal flip: 50% probability
- Vertical flip: 50% probability
- Brightness adjustment: ±15%
- Rationale: Maintains relative pixel relationships
- Contrast adjustment: ±15%
- Applied on-the-fly during training
- Random selection per sample per epoch
- No additional storage required
Target Metrics
- Primary: Recall improvement for 5 minority classes (families with <100 samples)
- Secondary: Impact on global accuracy (acceptable degradation: ≤2%)
- Tertiary: Macro F1-score to assess overall balanced performance
Experiment 3: CNN Depth Effect (H3)
Objective
Verify hypothesis H3: “Increasing CNN depth from 3 to 5 blocks will improve F1-score by ≥8 points, but with diminishing returns and higher computational cost.”Architecture Comparison
Two custom CNN architectures were compared with varying depth:Shallow CNN (3 blocks)
View Architecture
View Architecture
Structure:
- Block 1: Conv(32) + Conv(32) + MaxPool + Dropout(0.25)
- Block 2: Conv(64) + Conv(64) + MaxPool + Dropout(0.25)
- Block 3: Conv(128) + Conv(128) + MaxPool + Dropout(0.25)
- Classifier: Flatten + Dense(512) + Dropout(0.5) + Dense(256) + Output
Deep CNN (5 blocks) - H2_MOD.A
View Architecture
View Architecture
Structure:
- Block 1: Conv(32) + Conv(32) + MaxPool
- Block 2: Conv(64) + Conv(64) + MaxPool
- Block 3: Conv(128) + Conv(128) + MaxPool + Dropout(0.25)
- Block 4: Conv(256) + MaxPool
- Block 5: Conv(512) + MaxPool
- Classifier: Flatten + Dense(512) + Dropout(0.5) + Dense(256) + Output
Controlled Variables
Both models trained with identical hyperparameters:- Optimizer: Adam (lr=0.001)
- Epochs: 10
- Batch size: 32
- Loss function: Categorical Cross-Entropy
- Activation: ReLU
- No augmentation applied to isolate depth effect
Measured Aspects
-
Performance Metrics:
- Validation accuracy
- Test accuracy
- Macro F1-score
- Train/validation loss
-
Generalization Analysis:
- Generalization gap (train accuracy - validation accuracy)
- Train/Val loss ratio
- Learning curve stability
-
Computational Cost:
- Training time per epoch
- Total training time
- GPU memory usage
- FLOPs (floating-point operations)
Experimental Design Rationale
The experiment follows the ceteris paribus principle (all else constant), where only network depth varies between conditions, allowing isolation of its specific effect on performance and efficiency.
Reproducibility Considerations
All experiments were designed for reproducibility:Fixed Elements
- Random seed: 42
- Dataset splits saved and reused
- Hyperparameters documented
- Model architectures version-controlled
Tracked Metrics
- Training/validation loss per epoch
- Training/validation accuracy per epoch
- Precision, recall, F1-score (macro and weighted)
- Confusion matrices
- Learning curves
- Training time
- GPU memory consumption
Code Organization
Data Collection
For each experiment, the following data was systematically collected:Training Metrics
- Loss and accuracy curves (training and validation)
- Best epoch identification
- Early stopping triggers
- Learning rate adjustments
Evaluation Metrics
- Test set accuracy
- Per-class precision, recall, F1-score
- Confusion matrix
- Macro and weighted averages
Computational Metrics
- Training time (total and per epoch)
- Inference time
- GPU memory usage
- Model size (parameters and disk storage)
Qualitative Analysis
- Activation maps (Grad-CAM)
- Feature visualizations
- Misclassification analysis
- T-SNE embeddings of learned features
All experimental results, including detailed tables, figures, and statistical analysis, are presented in the Results section.