Skip to main content

Project Overview

This research project addresses automated malware family classification using Deep Learning techniques, specifically focusing on Convolutional Neural Networks (CNNs) and Vision Transformers applied to visual representations of malware executables.

Academic Context

Institution: Universidad de Caldas
Course: Sistemas Inteligentes II
Professor: Jorge Alberto Jaramillo Garzón

Research Problem

The contemporary cybersecurity landscape faces unprecedented challenges with exponential growth in malware volume and sophistication. Traditional antivirus systems based on static signatures and heuristic analysis are insufficient against modern polymorphic and obfuscated malware.
Key Challenge: Developing automated, efficient, and robust methods capable of identifying and classifying malware with high precision, even when encountering previously unseen samples.

Research Hypotheses

This project proposes three specific, quantifiable hypotheses that were experimentally verified:

H1: Architecture Comparison

Hypothesis: “In the malware classification task on the MalImg dataset, a ResNet50 model pre-trained on ImageNet with fine-tuning will outperform both a custom CNN and a Vision Transformer (ViT-Small) in accuracy and macro F1-score, due to transferable low-level features from ImageNet and the limited size of the malware dataset.”Variables:
  • Independent: Model architecture (custom CNN, fine-tuned ResNet50, ViT-Small)
  • Dependent: Accuracy, macro F1-score, epochs to convergence, training time
  • Control: Dataset (MalImg), maximum epochs, early stopping, base learning rate

H2: Data Augmentation Impact

Hypothesis: “The application of moderate data augmentation (rotation, horizontal flip, brightness/contrast variation) will significantly improve recall for underrepresented malware families without substantially degrading global model accuracy.”Variables:
  • Independent: Application of data augmentation (with/without)
  • Dependent: Minority class recall, global accuracy, macro F1-score
  • Control: Architecture (best model from H1), training hyperparameters

H3: CNN Depth Effect

Hypothesis: “Increasing the depth of a custom CNN (from 3 to 5 convolutional blocks) will improve model performance in terms of macro F1-score, but with diminishing returns and higher computational cost.”Variables:
  • Independent: Number of convolutional blocks (3 vs 5)
  • Dependent: Macro F1-score, accuracy, training time
  • Control: Dataset, batch size, learning rate, maximum epochs

Research Objectives

General Objective

Develop and implement a Deep Learning-based malware classification system that uses visual representations of executables to automatically identify malware families with high precision and efficiency.

Specific Objectives

  1. Data Preparation: Preprocess the MalImg dataset, implementing the pipeline for converting executables to visual representations with normalization to 224×224 pixels and stratified partitioning (70% training, 15% validation, 15% test)
  2. Architecture Implementation (H1): Design and implement three classification architectures:
    • Custom CNN with 5 convolutional blocks
    • Pre-trained ResNet50 with partial fine-tuning strategy
    • Vision Transformer (ViT-Small) adapted for malware images
  3. Architecture Experiment (H1): Train and evaluate the three architectures under controlled conditions, comparing accuracy, macro F1-score, convergence time, and parameter count
  4. Augmentation Experiment (H2): Evaluate the impact of moderate data augmentation on minority class recall using the best architecture from H1
  5. Depth Experiment (H3): Compare the performance of custom CNN with 3 vs 5 convolutional blocks, analyzing the trade-off between F1-score and training time
  6. Analysis and Interpretation: Generate visualizations of learned features (activation maps, t-SNE) to interpret what structural patterns distinguish malware families

Dataset: MalImg

The MalImg dataset contains 9,339 samples across 25 malware families, with images derived from Windows malware executables converted to grayscale visualizations.

Dataset Distribution

  • Training Set: 70% (6,537 samples)
  • Validation Set: 15% (1,401 samples)
  • Test Set: 15% (1,401 samples)
  • Stratification: Yes, maintaining class proportions

Key Characteristics

  • Images in grayscale derived from malware executables
  • 25 different Windows malware families
  • Class imbalance: 5 families with fewer than 100 samples (minority classes)
  • Families include Trojans, Worms, Backdoors, Ransomware, Adware/Spyware

Justification

The adoption of Deep Learning techniques for malware analysis is justified by: Automatic Feature Learning: Unlike traditional methods requiring manual feature engineering, CNNs automatically learn hierarchical discriminative representations directly from raw data. Scalability: Once trained, the model can classify new samples in near real-time, enabling processing of large data volumes. Robustness to Variations: Visual features captured by CNNs can be invariant to certain obfuscation techniques that alter code but preserve fundamental structures. Transferability: Models trained on certain datasets can be adapted (fine-tuned) to new datasets with lower computational cost. Practical Applicability: The proposed approach can be integrated into real threat detection systems, digital forensic analysis, and security incident response.

Scope and Limitations

Scope

  • Classification of known malware families in selected datasets
  • Static analysis through visual representations (no dynamic execution)
  • Evaluation in controlled environment with labeled samples
  • Standard CNN architectures and pre-trained variants

Limitations

  • Dependence on public datasets with potentially different distribution from real-world threats
  • Limited to malware families present in training data (zero-day detection would require additional approaches)
  • Focus on Windows malware (limited by available datasets)
  • Does not consider dynamic behavior analysis or hybrid techniques

Evaluation Criteria

The project was evaluated based on five criteria:
  1. Experiment Design - Clear hypothesis formulation, dataset selection and justification
  2. Experiment Development - Correct and reproducible implementation of training/validation pipeline
  3. Data and Results Analysis - Analysis of numerical results, learning curves, confusion matrices
  4. Engineering Judgment for Recommendations - Discussion of practical implications, justified recommendations
  5. Results Communication - Clear, structured technical report with readable figures and tables

This research contributes to cybersecurity by demonstrating that transfer learning is superior for malware classification on moderate-sized datasets, and that augmentation techniques effectively mitigate class imbalance without sacrificing global performance.

Build docs developers (and LLMs) love