Results Overview
This section presents comprehensive experimental results for all three hypotheses, including quantitative performance metrics, qualitative analysis, and hypothesis verification.Experiment 1 Results: Architecture Comparison (H1)
Performance Summary
| Model | Test Accuracy | Macro F1 | Best Epoch | Training Time | Expected |
|---|---|---|---|---|---|
| Conventional CNN (Baseline) | 72.39% | 74.01% | 9 | 39 min | — |
| VGG-Mini-H1 (5 blocks) | 61.30% | N/A | 10 | 397 min | ≥93% |
| ViT-Small | 74.92% | 76.48% | 10 | 76 min | ≥91% |
| ResNet50 (Fine-tuned) | 96.30% | 95.35% | 6 | 57 min | ≥96% |
Key Finding: ResNet50 achieved 96.30% accuracy, meeting the expected threshold and significantly outperforming all other architectures. It also converged fastest (6 epochs) and trained efficiently (57 minutes).
Detailed ResNet50 Metrics
The best-performing model (ResNet50 Fine-tuned) achieved:| Metric | Training | Validation |
|---|---|---|
| Loss (final) | 0.0378 | 0.1972 |
| Accuracy | 98.91% | 97.85% |
| Precision (macro) | — | 95.23% |
| Recall (macro) | — | 95.55% |
| F1-Score (macro) | 98.93% | 95.35% |
- Train-validation accuracy gap: 1.06% (excellent generalization)
- Loss ratio (train/val): 5.22 (acceptable, controlled overfitting)
Analysis by Architecture
ResNet50: Transfer Learning Superiority
ResNet50: Transfer Learning Superiority
ResNet50 with fine-tuning achieved the best performance with 96.30% accuracy and 95.35% F1-macro, confirming the effectiveness of transfer learning for moderate-sized datasets.Key Observations:
- Fast convergence: Reached best performance in only 6 epochs, significantly faster than other architectures
- Controlled overfitting: Gap between train (98.91%) and validation (97.85%) accuracy was only 1.06%, indicating good generalization
- Effective transfer: Low-level features learned on ImageNet (edges, textures, patterns) proved transferable to the malware image domain
- Efficiency: With 57 minutes training time, more efficient than ViT (76 min) and dramatically more efficient than 5-block CNN (397 min)
- Require fewer epochs to converge
- Learn domain-specific features more efficiently
- Achieve superior generalization despite limited data
ViT-Small: Limitations with Small Datasets
ViT-Small: Limitations with Small Datasets
Vision Transformer achieved 74.92% accuracy, below the expected threshold of 91%. This result aligns with literature indicating transformers require significantly larger datasets.Key Observations:
- Insufficient data: MalImg contains ~9,300 samples, far below the millions typically required to train transformers from scratch
- Lack of pre-training: Unlike ResNet50, ViT-Small was trained from scratch without leveraging prior knowledge
- Attention mechanism complexity: Transformers have more parameters to optimize, making learning difficult with limited data
VGG-Mini-H1 (5 blocks): Unexpected Poor Performance
VGG-Mini-H1 (5 blocks): Unexpected Poor Performance
The most surprising result was the poor performance of the 5-block CNN (61.30%), significantly lower than the conventional baseline (72.39%).Key Observations:
- Oversized architecture: 5 convolutional blocks with progression 32→64→128→256→512 filters proved excessive for the dataset
- Optimization difficulty: Extreme training time (397 minutes) suggests convergence problems
- Possible vanishing gradients: Depth without residual connections may have hindered gradient flow
- Important lesson: More depth doesn’t guarantee better performance; architecture should be proportional to dataset size and complexity
Conventional CNN: Simple but Effective Baseline
Conventional CNN: Simple but Effective Baseline
The conventional model (JorgeNet) with 72.39% accuracy provides an important reference point:
- Demonstrates that simple, well-designed architectures can be competitive
- Outperformed the 5-block CNN, evidencing that simplicity can be advantageous
- The 24 percentage point gap with ResNet50 quantifies the value of transfer learning
- Fast training (39 minutes) makes it suitable for rapid prototyping
Hypothesis H1 Verification
Status: ✅ CONFIRMEDResNet50 achieved 96.30% accuracy (exceeding 96% threshold) and 95.35% F1-macro, significantly outperforming custom CNN (72.39%) and ViT-Small (74.92%). Transfer learning proved superior for malware classification on moderate-sized datasets.
Experiment 2 Results: Data Augmentation Impact (H2)
Global Metrics Comparison
The experiment compared ResNet50 performance with and without data augmentation, focusing on minority class recall improvement.
- Minority class recall improvement: +17.2 percentage points
- Global accuracy impact: -0.4% (negligible)
- Verification: Hypothesis H2 confirmed
Expected vs. Actual Results
| Metric | Threshold | Achieved | Status |
|---|---|---|---|
| Minority recall increase | ≥15 pp | +17.2 pp | ✅ Exceeded |
| Global accuracy degradation | ≤2% | -0.4% | ✅ Within limit |
| Overall hypothesis | — | — | ✅ Confirmed |
Impact Analysis
Minority Class Benefits
Minority Class Benefits
Data augmentation significantly improved recall for underrepresented families:Average Improvement:
- Minority class recall increased from ~61% to ~78% (+17.2 pp)
- All 5 minority classes improved by ≥15 percentage points
- Most benefited classes: Lolyda.AA 3, Malex.gen!J (smallest families)
- Learn more robust features for underrepresented families
- Reduce overfitting to limited minority samples
- Better generalize to minority class test samples
Global Performance Trade-off
Global Performance Trade-off
The impact on overall accuracy was minimal:Trade-off Analysis:
- Global accuracy: 96.2% → 95.8% (-0.4%)
- Loss of 0.4% is negligible compared to +17.2 pp minority recall gain
- Macro F1-score improved due to better class balance
- Equity (minority recall): +17.2 pp
- Global performance cost: -0.4%
- Ratio: ~43:1 benefit-to-cost
Hypothesis H2 Verification
Status: ✅ CONFIRMEDData augmentation improved minority class recall by +17.2 pp (exceeding the +15 pp threshold) while global accuracy decreased only -0.4% (far below the -2% limit). The favorable trade-off validates augmentation as an effective strategy for addressing class imbalance.
Experiment 3 Results: CNN Depth Effect (H3)
Performance Comparison
| Architecture | Val Accuracy | Test Accuracy | Macro F1 | Train Time | Parameters |
|---|---|---|---|---|---|
| H2_MOD.A (9 layers) | 85.29% | N/A | N/A | 33m 48s | ~210,000 |
| 12-layer CNN | 83.45% | 85.58% | 83.54% | ~45m | ~280,000 (+33%) |
Detailed Metrics Analysis
| Category | Metric | 9 Layers | 12 Layers | Interpretation |
|---|---|---|---|---|
| Accuracy | Val Accuracy | 85.29% | 83.45% | Decreased with depth |
| Test Accuracy | N/A | 85.58% | Moderate generalization | |
| Loss | Val Loss | 0.3677 | 0.4095 | Higher residual error |
| Train Loss | 0.2644 | 0.3061 | Optimization difficulty | |
| Generalization | Gap (pp) | 2.86 | 4.13 | Increased overfitting |
| Train/Val Loss Ratio | 1.39 | 1.33 | Lower stability | |
| Metrics | Macro F1 | N/A | 83.54% | Unbalanced performance |
| Weighted F1 | N/A | 85.54% | Dominated by majority | |
| Efficiency | Training Time | 33m 48s | ~45m | +33% cost |
Key Finding: Increasing depth from 9 to 12 layers degraded validation accuracy (-1.84 pp), increased training time (+33%), and worsened the generalization gap (+44.4%), demonstrating diminishing returns with increased depth for this dataset size.
Analysis by Hypothesis Component
Performance Improvement (Expected but NOT Achieved)
Performance Improvement (Expected but NOT Achieved)
Expected: F1-score improvement of +8 percentage points
Actual: Validation accuracy decreased by -1.84 ppAnalysis:
Actual: Validation accuracy decreased by -1.84 ppAnalysis:
- Deeper model (12 layers) actually performed worse than shallower model (9 layers)
- The dataset size (~9,300 samples) may be insufficient to benefit from very deep architectures
- More parameters (+33%) did not translate to better performance
- Negative marginal return: -0.0055% per 1000 additional parameters
- Overfitting: Generalization gap increased from 2.86 pp to 4.13 pp (+44.4%)
- Vanishing gradients: Deeper network without residual connections struggled with gradient flow
- Co-adaptation: More parameters led to feature co-adaptation with poor generalization
- Insufficient regularization: Dropout alone insufficient for very deep networks
Diminishing Returns (CONFIRMED)
Diminishing Returns (CONFIRMED)
Evidence:
- 33% increase in parameters yielded negative performance return
- Loss increased, accuracy decreased
- Optimal depth appears to be around 9 layers for this configuration
- Parameter increase: +70,000 (+33%)
- Accuracy change: -1.84 pp
- Efficiency: Negative marginal productivity
Computational Cost (CONFIRMED)
Computational Cost (CONFIRMED)
Expected: ~40% increase in training time
Actual: +33% training time, +28% memory, +35% FLOPsComputational Analysis:
Actual: +33% training time, +28% memory, +35% FLOPsComputational Analysis:
- Forward Pass FLOPs: ~85 MFLOPs → ~115 MFLOPs (+35%)
- Memory Usage: ~45 MB → ~62 MB (+28%)
- Time per Epoch: 3.38 min → 4.57 min (+35%)
- Total Training Time: 33m 48s → ~45m (+33%)
Generalization Gap Analysis
The generalization gap increased by 44.4% (2.86 pp → 4.13 pp), attributable to:- Unfavorable parameter-to-data ratio: More parameters relative to training samples
- Partial gradient vanishing: Without residual connections, deeper networks struggle
- Feature co-adaptation: More layers can lead to co-adapted features with poor generalization
Learning Curve Observations
- 9-layer model: Stable convergence, consistent validation performance
- 12-layer model: Higher volatility, irregular validation loss curves, signs of optimization difficulty
Hypothesis H3 Verification
Status: ⚠️ PARTIALLY CONFIRMED
- ✅ Diminishing returns: Confirmed (negative returns observed)
- ✅ Computational cost increase: Confirmed (+33% time, aligned with ~40% expectation)
- ❌ Performance improvement: Rejected (accuracy decreased instead of improving +8 pp)
Cross-Experiment Insights
Key Findings Summary
- Transfer learning is superior: ResNet50 (96.30%) dramatically outperformed custom architectures, validating pre-training value
- Data augmentation effective for imbalance: +17.2 pp minority recall with only -0.4% global accuracy cost demonstrates effective imbalance mitigation
- Depth has limits: Simply increasing network depth without architectural innovations (like residual connections) can harm performance
- Dataset size matters: ~9,300 samples insufficient for very deep networks (5-block CNN, ViT) but adequate for transfer learning
Discriminative Features
Learned Feature Analysis (Grad-CAM)
Learned Feature Analysis (Grad-CAM)
Visualizations using Grad-CAM showed that models focus on:
- Dense code regions: .text section containing characteristic instructions for each family
- Resource sections: Import tables and data sections varying between families
- Structural patterns: Models learn to ignore padding regions (uniform areas), indicating learned features are relevant, not noise
Performance Comparison Across All Experiments
Best Configuration
Optimal Model: ResNet50 Fine-tuned with Data Augmentation- Test Accuracy: ~95.8%
- Macro F1-Score: ~95.0%
- Minority Class Recall: +17.2 pp improvement
- Training Time: 57 minutes
- Convergence: 6 epochs
Architecture Ranking
- ResNet50 (fine-tuned): 96.30% - Transfer learning winner
- ViT-Small: 74.92% - Limited by dataset size
- Conventional CNN: 72.39% - Simple but effective
- VGG-Mini-H1 (5 blocks): 61.30% - Oversized for dataset
Practical Recommendations
Based on experimental results:For Similar Projects:
- Use transfer learning (ResNet50, EfficientNet) for datasets <100k samples
- Apply moderate data augmentation to address class imbalance
- Avoid very deep custom CNNs without residual connections
- Start with simpler architectures and increase complexity only if justified by data scale
- Vision Transformers require datasets with millions of samples for competitive performance
Statistical Significance
All results are based on:- Fixed train/validation/test splits (stratified)
- Fixed random seed (42) for reproducibility
- Early stopping to prevent overfitting
- Multiple metrics (accuracy, precision, recall, F1) for robust evaluation