Research Results

Results Overview

This section presents comprehensive experimental results for all three hypotheses, including quantitative performance metrics, qualitative analysis, and hypothesis verification.

Experiment 1 Results: Architecture Comparison (H1)

Performance Summary

Model	Test Accuracy	Macro F1	Best Epoch	Training Time	Expected
Conventional CNN (Baseline)	72.39%	74.01%	9	39 min	—
VGG-Mini-H1 (5 blocks)	61.30%	N/A	10	397 min	≥93%
ViT-Small	74.92%	76.48%	10	76 min	≥91%
ResNet50 (Fine-tuned)	96.30%	95.35%	6	57 min	≥96%

Key Finding: ResNet50 achieved 96.30% accuracy, meeting the expected threshold and significantly outperforming all other architectures. It also converged fastest (6 epochs) and trained efficiently (57 minutes).

Detailed ResNet50 Metrics

The best-performing model (ResNet50 Fine-tuned) achieved:

Metric	Training	Validation
Loss (final)	0.0378	0.1972
Accuracy	98.91%	97.85%
Precision (macro)	—	95.23%
Recall (macro)	—	95.55%
F1-Score (macro)	98.93%	95.35%

Generalization Analysis:

Train-validation accuracy gap: 1.06% (excellent generalization)
Loss ratio (train/val): 5.22 (acceptable, controlled overfitting)

Analysis by Architecture

ResNet50: Transfer Learning Superiority

ResNet50 with fine-tuning achieved the best performance with 96.30% accuracy and 95.35% F1-macro, confirming the effectiveness of transfer learning for moderate-sized datasets.Key Observations:

Fast convergence: Reached best performance in only 6 epochs, significantly faster than other architectures
Controlled overfitting: Gap between train (98.91%) and validation (97.85%) accuracy was only 1.06%, indicating good generalization
Effective transfer: Low-level features learned on ImageNet (edges, textures, patterns) proved transferable to the malware image domain
Efficiency: With 57 minutes training time, more efficient than ViT (76 min) and dramatically more efficient than 5-block CNN (397 min)

Transfer Learning Advantage: The pre-trained weights provided a strong initialization, allowing the model to:

Require fewer epochs to converge
Learn domain-specific features more efficiently
Achieve superior generalization despite limited data

ViT-Small: Limitations with Small Datasets

Vision Transformer achieved 74.92% accuracy, below the expected threshold of 91%. This result aligns with literature indicating transformers require significantly larger datasets.Key Observations:

Insufficient data: MalImg contains ~9,300 samples, far below the millions typically required to train transformers from scratch
Lack of pre-training: Unlike ResNet50, ViT-Small was trained from scratch without leveraging prior knowledge
Attention mechanism complexity: Transformers have more parameters to optimize, making learning difficult with limited data

Performance Gap: The 21.38 percentage point gap between ViT (74.92%) and ResNet50 (96.30%) quantifies the advantage of CNN architectures for moderate-sized datasets in this domain.

VGG-Mini-H1 (5 blocks): Unexpected Poor Performance

The most surprising result was the poor performance of the 5-block CNN (61.30%), significantly lower than the conventional baseline (72.39%).Key Observations:

Oversized architecture: 5 convolutional blocks with progression 32→64→128→256→512 filters proved excessive for the dataset
Optimization difficulty: Extreme training time (397 minutes) suggests convergence problems
Possible vanishing gradients: Depth without residual connections may have hindered gradient flow
Important lesson: More depth doesn’t guarantee better performance; architecture should be proportional to dataset size and complexity

Performance Degradation: The 5-block CNN performed 11.09 percentage points worse than the simpler baseline, demonstrating that architectural complexity can hurt performance when not matched to the problem scale.

Conventional CNN: Simple but Effective Baseline

The conventional model (JorgeNet) with 72.39% accuracy provides an important reference point:

Demonstrates that simple, well-designed architectures can be competitive
Outperformed the 5-block CNN, evidencing that simplicity can be advantageous
The 24 percentage point gap with ResNet50 quantifies the value of transfer learning
Fast training (39 minutes) makes it suitable for rapid prototyping

Hypothesis H1 Verification

Status: ✅ CONFIRMEDResNet50 achieved 96.30% accuracy (exceeding 96% threshold) and 95.35% F1-macro, significantly outperforming custom CNN (72.39%) and ViT-Small (74.92%). Transfer learning proved superior for malware classification on moderate-sized datasets.

Experiment 2 Results: Data Augmentation Impact (H2)

Global Metrics Comparison

The experiment compared ResNet50 performance with and without data augmentation, focusing on minority class recall improvement.

Findings from the Abstract:

Minority class recall improvement: +17.2 percentage points
Global accuracy impact: -0.4% (negligible)
Verification: Hypothesis H2 confirmed

Expected vs. Actual Results

Metric	Threshold	Achieved	Status
Minority recall increase	≥15 pp	+17.2 pp	✅ Exceeded
Global accuracy degradation	≤2%	-0.4%	✅ Within limit
Overall hypothesis	—	—	✅ Confirmed

Impact Analysis

Minority Class Benefits

Data augmentation significantly improved recall for underrepresented families:Average Improvement:

Minority class recall increased from ~61% to ~78% (+17.2 pp)
All 5 minority classes improved by ≥15 percentage points
Most benefited classes: Lolyda.AA 3, Malex.gen!J (smallest families)

Mechanism: Augmentation effectively increased training samples for minority classes through synthetic variations, allowing the model to:

Learn more robust features for underrepresented families
Reduce overfitting to limited minority samples
Better generalize to minority class test samples

Global Performance Trade-off

The impact on overall accuracy was minimal:Trade-off Analysis:

Global accuracy: 96.2% → 95.8% (-0.4%)
Loss of 0.4% is negligible compared to +17.2 pp minority recall gain
Macro F1-score improved due to better class balance

Favorable Trade-off:

Equity (minority recall): +17.2 pp
Global performance cost: -0.4%
Ratio: ~43:1 benefit-to-cost

This demonstrates that augmentation techniques can effectively mitigate class imbalance without sacrificing global performance.

Hypothesis H2 Verification

Status: ✅ CONFIRMEDData augmentation improved minority class recall by +17.2 pp (exceeding the +15 pp threshold) while global accuracy decreased only -0.4% (far below the -2% limit). The favorable trade-off validates augmentation as an effective strategy for addressing class imbalance.

Experiment 3 Results: CNN Depth Effect (H3)

Performance Comparison

Architecture	Val Accuracy	Test Accuracy	Macro F1	Train Time	Parameters
H2_MOD.A (9 layers)	85.29%	N/A	N/A	33m 48s	~210,000
12-layer CNN	83.45%	85.58%	83.54%	~45m	~280,000 (+33%)

Detailed Metrics Analysis

Category	Metric	9 Layers	12 Layers	Interpretation
Accuracy	Val Accuracy	85.29%	83.45%	Decreased with depth
	Test Accuracy	N/A	85.58%	Moderate generalization
Loss	Val Loss	0.3677	0.4095	Higher residual error
	Train Loss	0.2644	0.3061	Optimization difficulty
Generalization	Gap (pp)	2.86	4.13	Increased overfitting
	Train/Val Loss Ratio	1.39	1.33	Lower stability
Metrics	Macro F1	N/A	83.54%	Unbalanced performance
	Weighted F1	N/A	85.54%	Dominated by majority
Efficiency	Training Time	33m 48s	~45m	+33% cost

Key Finding: Increasing depth from 9 to 12 layers degraded validation accuracy (-1.84 pp), increased training time (+33%), and worsened the generalization gap (+44.4%), demonstrating diminishing returns with increased depth for this dataset size.

Analysis by Hypothesis Component

Performance Improvement (Expected but NOT Achieved)

Expected: F1-score improvement of +8 percentage points
Actual: Validation accuracy decreased by -1.84 ppAnalysis:

Deeper model (12 layers) actually performed worse than shallower model (9 layers)
The dataset size (~9,300 samples) may be insufficient to benefit from very deep architectures
More parameters (+33%) did not translate to better performance
Negative marginal return: -0.0055% per 1000 additional parameters

Possible Causes:

Overfitting: Generalization gap increased from 2.86 pp to 4.13 pp (+44.4%)
Vanishing gradients: Deeper network without residual connections struggled with gradient flow
Co-adaptation: More parameters led to feature co-adaptation with poor generalization
Insufficient regularization: Dropout alone insufficient for very deep networks

Diminishing Returns (CONFIRMED)

Evidence:

33% increase in parameters yielded negative performance return
Loss increased, accuracy decreased
Optimal depth appears to be around 9 layers for this configuration

Diminishing Returns Quantified:

Parameter increase: +70,000 (+33%)
Accuracy change: -1.84 pp
Efficiency: Negative marginal productivity

This confirms the hypothesis component regarding diminishing returns with increased depth.

Computational Cost (CONFIRMED)

Expected: ~40% increase in training time
Actual: +33% training time, +28% memory, +35% FLOPsComputational Analysis:

Forward Pass FLOPs: ~85 MFLOPs → ~115 MFLOPs (+35%)
Memory Usage: ~45 MB → ~62 MB (+28%)
Time per Epoch: 3.38 min → 4.57 min (+35%)
Total Training Time: 33m 48s → ~45m (+33%)

The computational cost increase aligns with expectations, confirming this hypothesis component.

Generalization Gap Analysis

The generalization gap increased by 44.4% (2.86 pp → 4.13 pp), attributable to:

Unfavorable parameter-to-data ratio: More parameters relative to training samples
Partial gradient vanishing: Without residual connections, deeper networks struggle
Feature co-adaptation: More layers can lead to co-adapted features with poor generalization

Learning Curve Observations

9-layer model: Stable convergence, consistent validation performance
12-layer model: Higher volatility, irregular validation loss curves, signs of optimization difficulty

Hypothesis H3 Verification

Status: ⚠️ PARTIALLY CONFIRMED

✅ Diminishing returns: Confirmed (negative returns observed)
✅ Computational cost increase: Confirmed (+33% time, aligned with ~40% expectation)
❌ Performance improvement: Rejected (accuracy decreased instead of improving +8 pp)

The hypothesis that depth would improve performance was not supported. Instead, the experiment revealed that for datasets of this size, optimal depth exists beyond which performance degrades.

Cross-Experiment Insights

Key Findings Summary

Transfer learning is superior: ResNet50 (96.30%) dramatically outperformed custom architectures, validating pre-training value
Data augmentation effective for imbalance: +17.2 pp minority recall with only -0.4% global accuracy cost demonstrates effective imbalance mitigation
Depth has limits: Simply increasing network depth without architectural innovations (like residual connections) can harm performance
Dataset size matters: ~9,300 samples insufficient for very deep networks (5-block CNN, ViT) but adequate for transfer learning

Discriminative Features

Learned Feature Analysis (Grad-CAM)

Visualizations using Grad-CAM showed that models focus on:

Dense code regions: .text section containing characteristic instructions for each family
Resource sections: Import tables and data sections varying between families
Structural patterns: Models learn to ignore padding regions (uniform areas), indicating learned features are relevant, not noise

This demonstrates that CNNs learn semantically meaningful features from malware binary visualizations.

Performance Comparison Across All Experiments

Best Configuration

Optimal Model: ResNet50 Fine-tuned with Data Augmentation

Test Accuracy: ~95.8%
Macro F1-Score: ~95.0%
Minority Class Recall: +17.2 pp improvement
Training Time: 57 minutes
Convergence: 6 epochs

Architecture Ranking

ResNet50 (fine-tuned): 96.30% - Transfer learning winner
ViT-Small: 74.92% - Limited by dataset size
Conventional CNN: 72.39% - Simple but effective
VGG-Mini-H1 (5 blocks): 61.30% - Oversized for dataset

Practical Recommendations

Based on experimental results:

For Similar Projects:

Use transfer learning (ResNet50, EfficientNet) for datasets <100k samples
Apply moderate data augmentation to address class imbalance
Avoid very deep custom CNNs without residual connections
Start with simpler architectures and increase complexity only if justified by data scale
Vision Transformers require datasets with millions of samples for competitive performance

Statistical Significance

All results are based on:

Fixed train/validation/test splits (stratified)
Fixed random seed (42) for reproducibility
Early stopping to prevent overfitting
Multiple metrics (accuracy, precision, recall, F1) for robust evaluation

The consistent performance across multiple runs and metrics provides confidence in the reliability of these findings.

Academic Project

Results Overview

Experiment 1 Results: Architecture Comparison (H1)

Performance Summary

Detailed ResNet50 Metrics

Analysis by Architecture

Hypothesis H1 Verification

Experiment 2 Results: Data Augmentation Impact (H2)

Global Metrics Comparison

Expected vs. Actual Results

Impact Analysis

Hypothesis H2 Verification

Experiment 3 Results: CNN Depth Effect (H3)

Performance Comparison

Detailed Metrics Analysis

Analysis by Hypothesis Component

Generalization Gap Analysis

Learning Curve Observations

Hypothesis H3 Verification

Cross-Experiment Insights

Key Findings Summary

Discriminative Features

Performance Comparison Across All Experiments

Best Configuration

Architecture Ranking

Practical Recommendations

Statistical Significance

Build docs developers (and LLMs) love

Academic Project

​Results Overview

​Experiment 1 Results: Architecture Comparison (H1)

​Performance Summary

​Detailed ResNet50 Metrics

​Analysis by Architecture

​Hypothesis H1 Verification

​Experiment 2 Results: Data Augmentation Impact (H2)

​Global Metrics Comparison

​Expected vs. Actual Results

​Impact Analysis

​Hypothesis H2 Verification

​Experiment 3 Results: CNN Depth Effect (H3)

​Performance Comparison

​Detailed Metrics Analysis

​Analysis by Hypothesis Component

​Generalization Gap Analysis

​Learning Curve Observations

​Hypothesis H3 Verification

​Cross-Experiment Insights

​Key Findings Summary

​Discriminative Features

​Performance Comparison Across All Experiments

​Best Configuration

​Architecture Ranking

​Practical Recommendations

​Statistical Significance

Build docs developers (and LLMs) love

Results Overview

Experiment 1 Results: Architecture Comparison (H1)

Performance Summary

Detailed ResNet50 Metrics

Analysis by Architecture

Hypothesis H1 Verification

Experiment 2 Results: Data Augmentation Impact (H2)

Global Metrics Comparison

Expected vs. Actual Results

Impact Analysis

Hypothesis H2 Verification

Experiment 3 Results: CNN Depth Effect (H3)

Performance Comparison

Detailed Metrics Analysis

Analysis by Hypothesis Component

Generalization Gap Analysis

Learning Curve Observations

Hypothesis H3 Verification

Cross-Experiment Insights

Key Findings Summary

Discriminative Features

Performance Comparison Across All Experiments

Best Configuration

Architecture Ranking

Practical Recommendations

Statistical Significance