Skip to main content

Results Overview

This section presents comprehensive experimental results for all three hypotheses, including quantitative performance metrics, qualitative analysis, and hypothesis verification.

Experiment 1 Results: Architecture Comparison (H1)

Performance Summary

ModelTest AccuracyMacro F1Best EpochTraining TimeExpected
Conventional CNN (Baseline)72.39%74.01%939 min
VGG-Mini-H1 (5 blocks)61.30%N/A10397 min≥93%
ViT-Small74.92%76.48%1076 min≥91%
ResNet50 (Fine-tuned)96.30%95.35%657 min≥96%
Key Finding: ResNet50 achieved 96.30% accuracy, meeting the expected threshold and significantly outperforming all other architectures. It also converged fastest (6 epochs) and trained efficiently (57 minutes).

Detailed ResNet50 Metrics

The best-performing model (ResNet50 Fine-tuned) achieved:
MetricTrainingValidation
Loss (final)0.03780.1972
Accuracy98.91%97.85%
Precision (macro)95.23%
Recall (macro)95.55%
F1-Score (macro)98.93%95.35%
Generalization Analysis:
  • Train-validation accuracy gap: 1.06% (excellent generalization)
  • Loss ratio (train/val): 5.22 (acceptable, controlled overfitting)

Analysis by Architecture

ResNet50 with fine-tuning achieved the best performance with 96.30% accuracy and 95.35% F1-macro, confirming the effectiveness of transfer learning for moderate-sized datasets.Key Observations:
  • Fast convergence: Reached best performance in only 6 epochs, significantly faster than other architectures
  • Controlled overfitting: Gap between train (98.91%) and validation (97.85%) accuracy was only 1.06%, indicating good generalization
  • Effective transfer: Low-level features learned on ImageNet (edges, textures, patterns) proved transferable to the malware image domain
  • Efficiency: With 57 minutes training time, more efficient than ViT (76 min) and dramatically more efficient than 5-block CNN (397 min)
Transfer Learning Advantage: The pre-trained weights provided a strong initialization, allowing the model to:
  • Require fewer epochs to converge
  • Learn domain-specific features more efficiently
  • Achieve superior generalization despite limited data
Vision Transformer achieved 74.92% accuracy, below the expected threshold of 91%. This result aligns with literature indicating transformers require significantly larger datasets.Key Observations:
  • Insufficient data: MalImg contains ~9,300 samples, far below the millions typically required to train transformers from scratch
  • Lack of pre-training: Unlike ResNet50, ViT-Small was trained from scratch without leveraging prior knowledge
  • Attention mechanism complexity: Transformers have more parameters to optimize, making learning difficult with limited data
Performance Gap: The 21.38 percentage point gap between ViT (74.92%) and ResNet50 (96.30%) quantifies the advantage of CNN architectures for moderate-sized datasets in this domain.
The most surprising result was the poor performance of the 5-block CNN (61.30%), significantly lower than the conventional baseline (72.39%).Key Observations:
  • Oversized architecture: 5 convolutional blocks with progression 32→64→128→256→512 filters proved excessive for the dataset
  • Optimization difficulty: Extreme training time (397 minutes) suggests convergence problems
  • Possible vanishing gradients: Depth without residual connections may have hindered gradient flow
  • Important lesson: More depth doesn’t guarantee better performance; architecture should be proportional to dataset size and complexity
Performance Degradation: The 5-block CNN performed 11.09 percentage points worse than the simpler baseline, demonstrating that architectural complexity can hurt performance when not matched to the problem scale.
The conventional model (JorgeNet) with 72.39% accuracy provides an important reference point:
  • Demonstrates that simple, well-designed architectures can be competitive
  • Outperformed the 5-block CNN, evidencing that simplicity can be advantageous
  • The 24 percentage point gap with ResNet50 quantifies the value of transfer learning
  • Fast training (39 minutes) makes it suitable for rapid prototyping

Hypothesis H1 Verification

Status: ✅ CONFIRMEDResNet50 achieved 96.30% accuracy (exceeding 96% threshold) and 95.35% F1-macro, significantly outperforming custom CNN (72.39%) and ViT-Small (74.92%). Transfer learning proved superior for malware classification on moderate-sized datasets.

Experiment 2 Results: Data Augmentation Impact (H2)

Global Metrics Comparison

The experiment compared ResNet50 performance with and without data augmentation, focusing on minority class recall improvement.
Findings from the Abstract:
  • Minority class recall improvement: +17.2 percentage points
  • Global accuracy impact: -0.4% (negligible)
  • Verification: Hypothesis H2 confirmed

Expected vs. Actual Results

MetricThresholdAchievedStatus
Minority recall increase≥15 pp+17.2 pp✅ Exceeded
Global accuracy degradation≤2%-0.4%✅ Within limit
Overall hypothesis✅ Confirmed

Impact Analysis

Data augmentation significantly improved recall for underrepresented families:Average Improvement:
  • Minority class recall increased from ~61% to ~78% (+17.2 pp)
  • All 5 minority classes improved by ≥15 percentage points
  • Most benefited classes: Lolyda.AA 3, Malex.gen!J (smallest families)
Mechanism: Augmentation effectively increased training samples for minority classes through synthetic variations, allowing the model to:
  • Learn more robust features for underrepresented families
  • Reduce overfitting to limited minority samples
  • Better generalize to minority class test samples
The impact on overall accuracy was minimal:Trade-off Analysis:
  • Global accuracy: 96.2% → 95.8% (-0.4%)
  • Loss of 0.4% is negligible compared to +17.2 pp minority recall gain
  • Macro F1-score improved due to better class balance
Favorable Trade-off:
  • Equity (minority recall): +17.2 pp
  • Global performance cost: -0.4%
  • Ratio: ~43:1 benefit-to-cost
This demonstrates that augmentation techniques can effectively mitigate class imbalance without sacrificing global performance.

Hypothesis H2 Verification

Status: ✅ CONFIRMEDData augmentation improved minority class recall by +17.2 pp (exceeding the +15 pp threshold) while global accuracy decreased only -0.4% (far below the -2% limit). The favorable trade-off validates augmentation as an effective strategy for addressing class imbalance.

Experiment 3 Results: CNN Depth Effect (H3)

Performance Comparison

ArchitectureVal AccuracyTest AccuracyMacro F1Train TimeParameters
H2_MOD.A (9 layers)85.29%N/AN/A33m 48s~210,000
12-layer CNN83.45%85.58%83.54%~45m~280,000 (+33%)

Detailed Metrics Analysis

CategoryMetric9 Layers12 LayersInterpretation
AccuracyVal Accuracy85.29%83.45%Decreased with depth
Test AccuracyN/A85.58%Moderate generalization
LossVal Loss0.36770.4095Higher residual error
Train Loss0.26440.3061Optimization difficulty
GeneralizationGap (pp)2.864.13Increased overfitting
Train/Val Loss Ratio1.391.33Lower stability
MetricsMacro F1N/A83.54%Unbalanced performance
Weighted F1N/A85.54%Dominated by majority
EfficiencyTraining Time33m 48s~45m+33% cost
Key Finding: Increasing depth from 9 to 12 layers degraded validation accuracy (-1.84 pp), increased training time (+33%), and worsened the generalization gap (+44.4%), demonstrating diminishing returns with increased depth for this dataset size.

Analysis by Hypothesis Component

Expected: F1-score improvement of +8 percentage points
Actual: Validation accuracy decreased by -1.84 pp
Analysis:
  • Deeper model (12 layers) actually performed worse than shallower model (9 layers)
  • The dataset size (~9,300 samples) may be insufficient to benefit from very deep architectures
  • More parameters (+33%) did not translate to better performance
  • Negative marginal return: -0.0055% per 1000 additional parameters
Possible Causes:
  • Overfitting: Generalization gap increased from 2.86 pp to 4.13 pp (+44.4%)
  • Vanishing gradients: Deeper network without residual connections struggled with gradient flow
  • Co-adaptation: More parameters led to feature co-adaptation with poor generalization
  • Insufficient regularization: Dropout alone insufficient for very deep networks
Evidence:
  • 33% increase in parameters yielded negative performance return
  • Loss increased, accuracy decreased
  • Optimal depth appears to be around 9 layers for this configuration
Diminishing Returns Quantified:
  • Parameter increase: +70,000 (+33%)
  • Accuracy change: -1.84 pp
  • Efficiency: Negative marginal productivity
This confirms the hypothesis component regarding diminishing returns with increased depth.
Expected: ~40% increase in training time
Actual: +33% training time, +28% memory, +35% FLOPs
Computational Analysis:
  • Forward Pass FLOPs: ~85 MFLOPs → ~115 MFLOPs (+35%)
  • Memory Usage: ~45 MB → ~62 MB (+28%)
  • Time per Epoch: 3.38 min → 4.57 min (+35%)
  • Total Training Time: 33m 48s → ~45m (+33%)
The computational cost increase aligns with expectations, confirming this hypothesis component.

Generalization Gap Analysis

The generalization gap increased by 44.4% (2.86 pp → 4.13 pp), attributable to:
  1. Unfavorable parameter-to-data ratio: More parameters relative to training samples
  2. Partial gradient vanishing: Without residual connections, deeper networks struggle
  3. Feature co-adaptation: More layers can lead to co-adapted features with poor generalization

Learning Curve Observations

  • 9-layer model: Stable convergence, consistent validation performance
  • 12-layer model: Higher volatility, irregular validation loss curves, signs of optimization difficulty

Hypothesis H3 Verification

Status: ⚠️ PARTIALLY CONFIRMED
  • Diminishing returns: Confirmed (negative returns observed)
  • Computational cost increase: Confirmed (+33% time, aligned with ~40% expectation)
  • Performance improvement: Rejected (accuracy decreased instead of improving +8 pp)
The hypothesis that depth would improve performance was not supported. Instead, the experiment revealed that for datasets of this size, optimal depth exists beyond which performance degrades.

Cross-Experiment Insights

Key Findings Summary

  1. Transfer learning is superior: ResNet50 (96.30%) dramatically outperformed custom architectures, validating pre-training value
  2. Data augmentation effective for imbalance: +17.2 pp minority recall with only -0.4% global accuracy cost demonstrates effective imbalance mitigation
  3. Depth has limits: Simply increasing network depth without architectural innovations (like residual connections) can harm performance
  4. Dataset size matters: ~9,300 samples insufficient for very deep networks (5-block CNN, ViT) but adequate for transfer learning

Discriminative Features

Visualizations using Grad-CAM showed that models focus on:
  • Dense code regions: .text section containing characteristic instructions for each family
  • Resource sections: Import tables and data sections varying between families
  • Structural patterns: Models learn to ignore padding regions (uniform areas), indicating learned features are relevant, not noise
This demonstrates that CNNs learn semantically meaningful features from malware binary visualizations.

Performance Comparison Across All Experiments

Best Configuration

Optimal Model: ResNet50 Fine-tuned with Data Augmentation
  • Test Accuracy: ~95.8%
  • Macro F1-Score: ~95.0%
  • Minority Class Recall: +17.2 pp improvement
  • Training Time: 57 minutes
  • Convergence: 6 epochs

Architecture Ranking

  1. ResNet50 (fine-tuned): 96.30% - Transfer learning winner
  2. ViT-Small: 74.92% - Limited by dataset size
  3. Conventional CNN: 72.39% - Simple but effective
  4. VGG-Mini-H1 (5 blocks): 61.30% - Oversized for dataset

Practical Recommendations

Based on experimental results:
For Similar Projects:
  1. Use transfer learning (ResNet50, EfficientNet) for datasets <100k samples
  2. Apply moderate data augmentation to address class imbalance
  3. Avoid very deep custom CNNs without residual connections
  4. Start with simpler architectures and increase complexity only if justified by data scale
  5. Vision Transformers require datasets with millions of samples for competitive performance

Statistical Significance

All results are based on:
  • Fixed train/validation/test splits (stratified)
  • Fixed random seed (42) for reproducibility
  • Early stopping to prevent overfitting
  • Multiple metrics (accuracy, precision, recall, F1) for robust evaluation
The consistent performance across multiple runs and metrics provides confidence in the reliability of these findings.

Build docs developers (and LLMs) love