Project Overview
This research project addresses automated malware family classification using Deep Learning techniques, specifically focusing on Convolutional Neural Networks (CNNs) and Vision Transformers applied to visual representations of malware executables.Academic Context
Institution: Universidad de CaldasCourse: Sistemas Inteligentes II
Professor: Jorge Alberto Jaramillo Garzón
Research Problem
The contemporary cybersecurity landscape faces unprecedented challenges with exponential growth in malware volume and sophistication. Traditional antivirus systems based on static signatures and heuristic analysis are insufficient against modern polymorphic and obfuscated malware.Key Challenge: Developing automated, efficient, and robust methods capable of identifying and classifying malware with high precision, even when encountering previously unseen samples.
Research Hypotheses
This project proposes three specific, quantifiable hypotheses that were experimentally verified:H1: Architecture Comparison
View Hypothesis Details
View Hypothesis Details
Hypothesis: “In the malware classification task on the MalImg dataset, a ResNet50 model pre-trained on ImageNet with fine-tuning will outperform both a custom CNN and a Vision Transformer (ViT-Small) in accuracy and macro F1-score, due to transferable low-level features from ImageNet and the limited size of the malware dataset.”Variables:
- Independent: Model architecture (custom CNN, fine-tuned ResNet50, ViT-Small)
- Dependent: Accuracy, macro F1-score, epochs to convergence, training time
- Control: Dataset (MalImg), maximum epochs, early stopping, base learning rate
H2: Data Augmentation Impact
View Hypothesis Details
View Hypothesis Details
Hypothesis: “The application of moderate data augmentation (rotation, horizontal flip, brightness/contrast variation) will significantly improve recall for underrepresented malware families without substantially degrading global model accuracy.”Variables:
- Independent: Application of data augmentation (with/without)
- Dependent: Minority class recall, global accuracy, macro F1-score
- Control: Architecture (best model from H1), training hyperparameters
H3: CNN Depth Effect
View Hypothesis Details
View Hypothesis Details
Hypothesis: “Increasing the depth of a custom CNN (from 3 to 5 convolutional blocks) will improve model performance in terms of macro F1-score, but with diminishing returns and higher computational cost.”Variables:
- Independent: Number of convolutional blocks (3 vs 5)
- Dependent: Macro F1-score, accuracy, training time
- Control: Dataset, batch size, learning rate, maximum epochs
Research Objectives
General Objective
Develop and implement a Deep Learning-based malware classification system that uses visual representations of executables to automatically identify malware families with high precision and efficiency.Specific Objectives
- Data Preparation: Preprocess the MalImg dataset, implementing the pipeline for converting executables to visual representations with normalization to 224×224 pixels and stratified partitioning (70% training, 15% validation, 15% test)
-
Architecture Implementation (H1): Design and implement three classification architectures:
- Custom CNN with 5 convolutional blocks
- Pre-trained ResNet50 with partial fine-tuning strategy
- Vision Transformer (ViT-Small) adapted for malware images
- Architecture Experiment (H1): Train and evaluate the three architectures under controlled conditions, comparing accuracy, macro F1-score, convergence time, and parameter count
- Augmentation Experiment (H2): Evaluate the impact of moderate data augmentation on minority class recall using the best architecture from H1
- Depth Experiment (H3): Compare the performance of custom CNN with 3 vs 5 convolutional blocks, analyzing the trade-off between F1-score and training time
- Analysis and Interpretation: Generate visualizations of learned features (activation maps, t-SNE) to interpret what structural patterns distinguish malware families
Dataset: MalImg
The MalImg dataset contains 9,339 samples across 25 malware families, with images derived from Windows malware executables converted to grayscale visualizations.
Dataset Distribution
- Training Set: 70% (6,537 samples)
- Validation Set: 15% (1,401 samples)
- Test Set: 15% (1,401 samples)
- Stratification: Yes, maintaining class proportions
Key Characteristics
- Images in grayscale derived from malware executables
- 25 different Windows malware families
- Class imbalance: 5 families with fewer than 100 samples (minority classes)
- Families include Trojans, Worms, Backdoors, Ransomware, Adware/Spyware
Justification
The adoption of Deep Learning techniques for malware analysis is justified by: Automatic Feature Learning: Unlike traditional methods requiring manual feature engineering, CNNs automatically learn hierarchical discriminative representations directly from raw data. Scalability: Once trained, the model can classify new samples in near real-time, enabling processing of large data volumes. Robustness to Variations: Visual features captured by CNNs can be invariant to certain obfuscation techniques that alter code but preserve fundamental structures. Transferability: Models trained on certain datasets can be adapted (fine-tuned) to new datasets with lower computational cost. Practical Applicability: The proposed approach can be integrated into real threat detection systems, digital forensic analysis, and security incident response.Scope and Limitations
Scope
- Classification of known malware families in selected datasets
- Static analysis through visual representations (no dynamic execution)
- Evaluation in controlled environment with labeled samples
- Standard CNN architectures and pre-trained variants
Limitations
- Dependence on public datasets with potentially different distribution from real-world threats
- Limited to malware families present in training data (zero-day detection would require additional approaches)
- Focus on Windows malware (limited by available datasets)
- Does not consider dynamic behavior analysis or hybrid techniques
Evaluation Criteria
The project was evaluated based on five criteria:- Experiment Design - Clear hypothesis formulation, dataset selection and justification
- Experiment Development - Correct and reproducible implementation of training/validation pipeline
- Data and Results Analysis - Analysis of numerical results, learning curves, confusion matrices
- Engineering Judgment for Recommendations - Discussion of practical implications, justified recommendations
- Results Communication - Clear, structured technical report with readable figures and tables
This research contributes to cybersecurity by demonstrating that transfer learning is superior for malware classification on moderate-sized datasets, and that augmentation techniques effectively mitigate class imbalance without sacrificing global performance.