Skip to main content

Process Overview

The methodological process follows a structured pipeline:
  1. Dataset Acquisition: Download and organization of public malware datasets
  2. Preprocessing: Conversion of executables to images and normalization
  3. Exploratory Analysis: Visualization and statistics of malware families
  4. Architecture Design: Implementation of CNN models
  5. Training: Hyperparameter configuration and optimization
  6. Evaluation: Performance measurement using standard metrics
  7. Comparative Analysis: Comparison between different models

Dataset: MalImg

Description

Source: Available on Kaggle (manaswinisunkari/malimg-dataset90)
Composition: Approximately 9,339+ malware samples
Families: 25 different Windows malware families
Format: Grayscale images derived from binary executables

Malware Family Distribution

The dataset contains diverse malware families including:
  • Trojans: Alureon, VB, Agent
  • Worms: Kelihos, Autorun, Worm
  • Backdoors: Rbot, Bifrose
  • Ransomware: Locker, Cryptolocker
  • Adware/Spyware: Winwebsec, FakeRean
The class distribution presents certain imbalance, with 5 families having fewer than 100 samples. This aspect is considered in the training strategy through balancing and weighting techniques.

Data Preprocessing

Executable to Image Conversion

The binary-to-image conversion process follows these steps:
  1. Binary Reading: The executable file is read as a sequence of bytes (values 0-255)
  2. Pixel Mapping: Each byte is interpreted as pixel intensity in grayscale
  3. Dimension Determination:
    • Calculate total file size in bytes
    • Determine optimal image width (typically power of 2)
    • Height calculated as: height = ⌈size_bytes / width⌉
  4. Matrix Construction: Bytes are organized into a 2D matrix with calculated dimensions
  5. Padding: If necessary, padding pixels are added to complete the last row

Normalization and Resizing

To ensure training uniformity:
  • Resizing: All images resized to fixed size (224×224 or 256×256) via bilinear interpolation
  • Pixel Normalization: Pixel values scaled to range [0, 1] by dividing by 255
  • Standardization: Optionally, per-channel normalization using mean and standard deviation of training dataset

Data Augmentation

For improving generalization and mitigating overfitting, data augmentation techniques are applied during training. However, unlike natural images, malware visualizations require special considerations since each pixel directly represents a byte of the original executable.

Lossless Transformations

Arbitrary rotations (e.g., ±15°) require pixel interpolation, which modifies original values and destroys the byte-to-pixel correspondence. Therefore, orthogonal rotations (90°, 180°, 270°) are exclusively employed, constituting pure permutations of pixel positions without value alteration.This decision is based on two critical aspects:
  1. Semantic Preservation: Orthogonal rotations maintain intact the structural information of the binary
  2. Evasion Robustness: Recent research demonstrates that malware CNN classifiers are vulnerable to adversarial attacks based on image transformations. Training with orthogonal rotations improves resistance to these evasion techniques

Applied Augmentation Techniques

  • Orthogonal rotations: 90°, 180°, 270° (random selection)
  • Flips: Horizontal and vertical
  • Brightness adjustments: ±10-20% (preserves relative relationships between pixels)
  • Contrast adjustments: ±10-20%
These transformations are applied on-the-fly during training, generating random variants in each epoch without additional storage.

Data Split

Datasets are divided following a stratified partitioning strategy to maintain class proportions:
  • Training Set: 70% of samples (6,537) - used for weight optimization
  • Validation Set: 15% of samples (1,401) - used for hyperparameter tuning and overfitting detection
  • Test Set: 15% of samples (1,401) - used exclusively for final evaluation
Test samples are never seen during training or validation, ensuring unbiased performance evaluation.

Neural Network Architectures

Custom CNN Architecture

A baseline CNN architecture adapted to malware image characteristics:
Layer Structure:
  1. Convolutional Block 1:
    • Conv2D: 32 filters, 3×3 kernel, ReLU activation
    • Conv2D: 32 filters, 3×3 kernel, ReLU activation
    • MaxPooling2D: 2×2 pool size
    • Dropout: 25%
  2. Convolutional Block 2:
    • Conv2D: 64 filters, 3×3 kernel, ReLU activation
    • Conv2D: 64 filters, 3×3 kernel, ReLU activation
    • MaxPooling2D: 2×2 pool size
    • Dropout: 25%
  3. Convolutional Block 3:
    • Conv2D: 128 filters, 3×3 kernel, ReLU activation
    • Conv2D: 128 filters, 3×3 kernel, ReLU activation
    • MaxPooling2D: 2×2 pool size
    • Dropout: 25%
  4. Fully Connected Layers:
    • Flatten
    • Dense: 512 units, ReLU activation
    • Dropout: 50%
    • Dense: 256 units, ReLU activation
    • Dropout: 50%
    • Dense: N units (number of classes), Softmax activation

Pre-trained Models with Transfer Learning

ResNet50

  • Architecture with residual connections
  • Effective gradient handling in deep networks
  • Transfer of low and high-level features
  • Adaptation of final classification layer
  • Strategy: Partial fine-tuning (20-30 last layers unfrozen)

Vision Transformer (ViT-Small)

  • Patch size: 16×16
  • Embedding dimension: 384
  • Depth: 12 transformer blocks
  • Attention heads: 6
  • MLP ratio: 4.0
  • Dropout: 0.1

Training Configuration

Hyperparameters

Loss Function: Categorical Cross-Entropy (for multi-class classification)
Optimizer: Adam (Adaptive Moment Estimation)
  • Initial learning rate: 0.001 (CNN) / 0.0001 (ResNet50, ViT)
  • β₁ = 0.9, β₂ = 0.999
  • ε = 10⁻⁷
Batch size: 32
Epochs: Up to 100 with early stopping

Regularization

  • Dropout: 0.25 (convolutional layers), 0.5 (dense layers)
  • L2 regularization: λ = 0.0001

Optimization Strategies

Learning Rate Scheduling:
  • ReduceLROnPlateau: Reduces learning rate when validation loss stagnates
  • Reduction factor: 0.5
  • Patience: 5 epochs
Early Stopping:
  • Monitoring validation loss
  • Patience: 10-15 epochs
  • Restoration of best weights
Model Checkpointing:
  • Saving model with best validation performance
  • Monitoring metric: Accuracy or F1-score

Class Imbalance Handling

To address imbalance in class distribution:
  • Class Weighting: Assignment of weights inversely proportional to class frequency
  • Stratified Sampling: Stratified sampling in data splitting
  • Focal Loss: (optional) Loss function emphasizing hard-to-classify examples

Experimental Environment

Hardware and Software

Hardware:
  • GPU: NVIDIA T4 (16GB VRAM)
  • Training optimized for GPU computation with CUDA
Software:
  • Python: 3.8+
  • Deep Learning Framework: TensorFlow 2.x / PyTorch
  • Additional libraries:
    • NumPy, Pandas (data manipulation)
    • Matplotlib, Seaborn (visualization)
    • Scikit-learn (metrics and preprocessing)
    • OpenCV / Pillow (image processing)

Implementation Structure

project/
├── data/          # Download and preprocessing scripts
├── models/        # CNN architecture definitions
├── training/      # Training scripts and callbacks
├── evaluation/    # Model evaluation and metrics generation
├── notebooks/     # Jupyter notebooks for exploratory analysis
└── utils/         # Auxiliary functions (visualization, logging)

Evaluation Metrics

Primary Metrics

Accuracy: (TP + TN) / (TP + TN + FP + FN) Precision: TP / (TP + FP) Recall: TP / (TP + FN) F1-Score: 2 × (Precision × Recall) / (Precision + Recall) Where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.

Multi-Class Metrics

For multi-class classification, averages are calculated:
  • Macro-Average: Simple average across all classes (equal weight per class)
  • Weighted-Average: Weighted average by number of samples per class

Additional Analysis

  • Confusion Matrix: Visualizes misclassification patterns and confusion between similar families
  • ROC Curves and AUC: Calculated for multi-class using One-vs-Rest strategy
  • Learning Curves: Track training and validation metrics across epochs

Ethical and Security Considerations

This project handles real malware samples, requiring precautions:
  • Isolated Environment: All experiments conducted in isolated virtual machines without network access
  • Public Datasets: Exclusively using public datasets for research purposes
  • No Execution: Analysis is completely static, without executing malicious binaries
  • Secure Storage: Datasets stored in encrypted partitions
  • Responsible Use: Trained models employed exclusively for educational and research purposes

Methodology Summary

This methodology encompasses:
  • Selection and description of public malware datasets
  • Pipeline for preprocessing and converting binaries to images
  • Design of custom CNN architectures and use of transfer learning
  • Hyperparameter configuration and training strategies
  • Evaluation metrics for multi-class classification
  • Ethical and security considerations
The following sections present experimental results obtained by applying this methodology.

Build docs developers (and LLMs) love