Research Overview

Project Overview

This research project addresses automated malware family classification using Deep Learning techniques, specifically focusing on Convolutional Neural Networks (CNNs) and Vision Transformers applied to visual representations of malware executables.

Academic Context

Institution: Universidad de Caldas
Course: Sistemas Inteligentes II
Professor: Jorge Alberto Jaramillo Garzón

Research Problem

The contemporary cybersecurity landscape faces unprecedented challenges with exponential growth in malware volume and sophistication. Traditional antivirus systems based on static signatures and heuristic analysis are insufficient against modern polymorphic and obfuscated malware.

Key Challenge: Developing automated, efficient, and robust methods capable of identifying and classifying malware with high precision, even when encountering previously unseen samples.

Research Hypotheses

This project proposes three specific, quantifiable hypotheses that were experimentally verified:

H1: Architecture Comparison

View Hypothesis Details

Hypothesis: “In the malware classification task on the MalImg dataset, a ResNet50 model pre-trained on ImageNet with fine-tuning will outperform both a custom CNN and a Vision Transformer (ViT-Small) in accuracy and macro F1-score, due to transferable low-level features from ImageNet and the limited size of the malware dataset.”Variables:

Independent: Model architecture (custom CNN, fine-tuned ResNet50, ViT-Small)
Dependent: Accuracy, macro F1-score, epochs to convergence, training time
Control: Dataset (MalImg), maximum epochs, early stopping, base learning rate

H2: Data Augmentation Impact

View Hypothesis Details

Hypothesis: “The application of moderate data augmentation (rotation, horizontal flip, brightness/contrast variation) will significantly improve recall for underrepresented malware families without substantially degrading global model accuracy.”Variables:

Independent: Application of data augmentation (with/without)
Dependent: Minority class recall, global accuracy, macro F1-score
Control: Architecture (best model from H1), training hyperparameters

H3: CNN Depth Effect

View Hypothesis Details

Hypothesis: “Increasing the depth of a custom CNN (from 3 to 5 convolutional blocks) will improve model performance in terms of macro F1-score, but with diminishing returns and higher computational cost.”Variables:

Independent: Number of convolutional blocks (3 vs 5)
Dependent: Macro F1-score, accuracy, training time
Control: Dataset, batch size, learning rate, maximum epochs

Research Objectives

General Objective

Develop and implement a Deep Learning-based malware classification system that uses visual representations of executables to automatically identify malware families with high precision and efficiency.

Specific Objectives

Data Preparation: Preprocess the MalImg dataset, implementing the pipeline for converting executables to visual representations with normalization to 224×224 pixels and stratified partitioning (70% training, 15% validation, 15% test)
Architecture Implementation (H1): Design and implement three classification architectures:
- Custom CNN with 5 convolutional blocks
- Pre-trained ResNet50 with partial fine-tuning strategy
- Vision Transformer (ViT-Small) adapted for malware images
Architecture Experiment (H1): Train and evaluate the three architectures under controlled conditions, comparing accuracy, macro F1-score, convergence time, and parameter count
Augmentation Experiment (H2): Evaluate the impact of moderate data augmentation on minority class recall using the best architecture from H1
Depth Experiment (H3): Compare the performance of custom CNN with 3 vs 5 convolutional blocks, analyzing the trade-off between F1-score and training time
Analysis and Interpretation: Generate visualizations of learned features (activation maps, t-SNE) to interpret what structural patterns distinguish malware families

Dataset: MalImg

The MalImg dataset contains 9,339 samples across 25 malware families, with images derived from Windows malware executables converted to grayscale visualizations.

Dataset Distribution

Training Set: 70% (6,537 samples)
Validation Set: 15% (1,401 samples)
Test Set: 15% (1,401 samples)
Stratification: Yes, maintaining class proportions

Key Characteristics

Images in grayscale derived from malware executables
25 different Windows malware families
Class imbalance: 5 families with fewer than 100 samples (minority classes)
Families include Trojans, Worms, Backdoors, Ransomware, Adware/Spyware

Justification

The adoption of Deep Learning techniques for malware analysis is justified by: Automatic Feature Learning: Unlike traditional methods requiring manual feature engineering, CNNs automatically learn hierarchical discriminative representations directly from raw data. Scalability: Once trained, the model can classify new samples in near real-time, enabling processing of large data volumes. Robustness to Variations: Visual features captured by CNNs can be invariant to certain obfuscation techniques that alter code but preserve fundamental structures. Transferability: Models trained on certain datasets can be adapted (fine-tuned) to new datasets with lower computational cost. Practical Applicability: The proposed approach can be integrated into real threat detection systems, digital forensic analysis, and security incident response.

Scope and Limitations

Scope

Classification of known malware families in selected datasets
Static analysis through visual representations (no dynamic execution)
Evaluation in controlled environment with labeled samples
Standard CNN architectures and pre-trained variants

Limitations

Dependence on public datasets with potentially different distribution from real-world threats
Limited to malware families present in training data (zero-day detection would require additional approaches)
Focus on Windows malware (limited by available datasets)
Does not consider dynamic behavior analysis or hybrid techniques

Evaluation Criteria

The project was evaluated based on five criteria:

Experiment Design - Clear hypothesis formulation, dataset selection and justification
Experiment Development - Correct and reproducible implementation of training/validation pipeline
Data and Results Analysis - Analysis of numerical results, learning curves, confusion matrices
Engineering Judgment for Recommendations - Discussion of practical implications, justified recommendations
Results Communication - Clear, structured technical report with readable figures and tables

This research contributes to cybersecurity by demonstrating that transfer learning is superior for malware classification on moderate-sized datasets, and that augmentation techniques effectively mitigate class imbalance without sacrificing global performance.

Academic Project

Project Overview

Academic Context

Research Problem

Research Hypotheses

H1: Architecture Comparison

H2: Data Augmentation Impact

H3: CNN Depth Effect

Research Objectives

General Objective

Specific Objectives

Dataset: MalImg

Dataset Distribution

Key Characteristics

Justification

Scope and Limitations

Scope

Limitations

Evaluation Criteria

Build docs developers (and LLMs) love

Academic Project

​Project Overview

​Academic Context

​Research Problem

​Research Hypotheses

​H1: Architecture Comparison

​H2: Data Augmentation Impact

​H3: CNN Depth Effect

​Research Objectives

​General Objective

​Specific Objectives

​Dataset: MalImg

​Dataset Distribution

​Key Characteristics

​Justification

​Scope and Limitations

​Scope

​Limitations

​Evaluation Criteria

Build docs developers (and LLMs) love

Project Overview

Academic Context

Research Problem

Research Hypotheses

H1: Architecture Comparison

H2: Data Augmentation Impact

H3: CNN Depth Effect

Research Objectives

General Objective

Specific Objectives

Dataset: MalImg

Dataset Distribution

Key Characteristics

Justification

Scope and Limitations

Scope

Limitations

Evaluation Criteria