Process Overview
The methodological process follows a structured pipeline:- Dataset Acquisition: Download and organization of public malware datasets
- Preprocessing: Conversion of executables to images and normalization
- Exploratory Analysis: Visualization and statistics of malware families
- Architecture Design: Implementation of CNN models
- Training: Hyperparameter configuration and optimization
- Evaluation: Performance measurement using standard metrics
- Comparative Analysis: Comparison between different models
Dataset: MalImg
Description
Source: Available on Kaggle (
Composition: Approximately 9,339+ malware samples
Families: 25 different Windows malware families
Format: Grayscale images derived from binary executables
manaswinisunkari/malimg-dataset90)Composition: Approximately 9,339+ malware samples
Families: 25 different Windows malware families
Format: Grayscale images derived from binary executables
Malware Family Distribution
The dataset contains diverse malware families including:- Trojans: Alureon, VB, Agent
- Worms: Kelihos, Autorun, Worm
- Backdoors: Rbot, Bifrose
- Ransomware: Locker, Cryptolocker
- Adware/Spyware: Winwebsec, FakeRean
The class distribution presents certain imbalance, with 5 families having fewer than 100 samples. This aspect is considered in the training strategy through balancing and weighting techniques.
Data Preprocessing
Executable to Image Conversion
The binary-to-image conversion process follows these steps:- Binary Reading: The executable file is read as a sequence of bytes (values 0-255)
- Pixel Mapping: Each byte is interpreted as pixel intensity in grayscale
-
Dimension Determination:
- Calculate total file size in bytes
- Determine optimal image width (typically power of 2)
- Height calculated as:
height = ⌈size_bytes / width⌉
- Matrix Construction: Bytes are organized into a 2D matrix with calculated dimensions
- Padding: If necessary, padding pixels are added to complete the last row
Normalization and Resizing
To ensure training uniformity:- Resizing: All images resized to fixed size (224×224 or 256×256) via bilinear interpolation
- Pixel Normalization: Pixel values scaled to range [0, 1] by dividing by 255
- Standardization: Optionally, per-channel normalization using mean and standard deviation of training dataset
Data Augmentation
View Augmentation Strategy Details
View Augmentation Strategy Details
For improving generalization and mitigating overfitting, data augmentation techniques are applied during training. However, unlike natural images, malware visualizations require special considerations since each pixel directly represents a byte of the original executable.
Lossless Transformations
Arbitrary rotations (e.g., ±15°) require pixel interpolation, which modifies original values and destroys the byte-to-pixel correspondence. Therefore, orthogonal rotations (90°, 180°, 270°) are exclusively employed, constituting pure permutations of pixel positions without value alteration.This decision is based on two critical aspects:- Semantic Preservation: Orthogonal rotations maintain intact the structural information of the binary
- Evasion Robustness: Recent research demonstrates that malware CNN classifiers are vulnerable to adversarial attacks based on image transformations. Training with orthogonal rotations improves resistance to these evasion techniques
Applied Augmentation Techniques
- Orthogonal rotations: 90°, 180°, 270° (random selection)
- Flips: Horizontal and vertical
- Brightness adjustments: ±10-20% (preserves relative relationships between pixels)
- Contrast adjustments: ±10-20%
Data Split
Datasets are divided following a stratified partitioning strategy to maintain class proportions:- Training Set: 70% of samples (6,537) - used for weight optimization
- Validation Set: 15% of samples (1,401) - used for hyperparameter tuning and overfitting detection
- Test Set: 15% of samples (1,401) - used exclusively for final evaluation
Test samples are never seen during training or validation, ensuring unbiased performance evaluation.
Neural Network Architectures
Custom CNN Architecture
A baseline CNN architecture adapted to malware image characteristics:View Architecture Details
View Architecture Details
Layer Structure:
-
Convolutional Block 1:
- Conv2D: 32 filters, 3×3 kernel, ReLU activation
- Conv2D: 32 filters, 3×3 kernel, ReLU activation
- MaxPooling2D: 2×2 pool size
- Dropout: 25%
-
Convolutional Block 2:
- Conv2D: 64 filters, 3×3 kernel, ReLU activation
- Conv2D: 64 filters, 3×3 kernel, ReLU activation
- MaxPooling2D: 2×2 pool size
- Dropout: 25%
-
Convolutional Block 3:
- Conv2D: 128 filters, 3×3 kernel, ReLU activation
- Conv2D: 128 filters, 3×3 kernel, ReLU activation
- MaxPooling2D: 2×2 pool size
- Dropout: 25%
-
Fully Connected Layers:
- Flatten
- Dense: 512 units, ReLU activation
- Dropout: 50%
- Dense: 256 units, ReLU activation
- Dropout: 50%
- Dense: N units (number of classes), Softmax activation
Pre-trained Models with Transfer Learning
ResNet50
- Architecture with residual connections
- Effective gradient handling in deep networks
- Transfer of low and high-level features
- Adaptation of final classification layer
- Strategy: Partial fine-tuning (20-30 last layers unfrozen)
Vision Transformer (ViT-Small)
- Patch size: 16×16
- Embedding dimension: 384
- Depth: 12 transformer blocks
- Attention heads: 6
- MLP ratio: 4.0
- Dropout: 0.1
Training Configuration
Hyperparameters
Loss Function: Categorical Cross-Entropy (for multi-class classification)
Optimizer: Adam (Adaptive Moment Estimation)
Epochs: Up to 100 with early stopping
Optimizer: Adam (Adaptive Moment Estimation)
- Initial learning rate: 0.001 (CNN) / 0.0001 (ResNet50, ViT)
- β₁ = 0.9, β₂ = 0.999
- ε = 10⁻⁷
Epochs: Up to 100 with early stopping
Regularization
- Dropout: 0.25 (convolutional layers), 0.5 (dense layers)
- L2 regularization: λ = 0.0001
Optimization Strategies
Learning Rate Scheduling:- ReduceLROnPlateau: Reduces learning rate when validation loss stagnates
- Reduction factor: 0.5
- Patience: 5 epochs
- Monitoring validation loss
- Patience: 10-15 epochs
- Restoration of best weights
- Saving model with best validation performance
- Monitoring metric: Accuracy or F1-score
Class Imbalance Handling
To address imbalance in class distribution:- Class Weighting: Assignment of weights inversely proportional to class frequency
- Stratified Sampling: Stratified sampling in data splitting
- Focal Loss: (optional) Loss function emphasizing hard-to-classify examples
Experimental Environment
Hardware and Software
Hardware:- GPU: NVIDIA T4 (16GB VRAM)
- Training optimized for GPU computation with CUDA
- Python: 3.8+
- Deep Learning Framework: TensorFlow 2.x / PyTorch
- Additional libraries:
- NumPy, Pandas (data manipulation)
- Matplotlib, Seaborn (visualization)
- Scikit-learn (metrics and preprocessing)
- OpenCV / Pillow (image processing)
Implementation Structure
Evaluation Metrics
Primary Metrics
Accuracy:(TP + TN) / (TP + TN + FP + FN)
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1-Score: 2 × (Precision × Recall) / (Precision + Recall)
Where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.
Multi-Class Metrics
For multi-class classification, averages are calculated:- Macro-Average: Simple average across all classes (equal weight per class)
- Weighted-Average: Weighted average by number of samples per class
Additional Analysis
- Confusion Matrix: Visualizes misclassification patterns and confusion between similar families
- ROC Curves and AUC: Calculated for multi-class using One-vs-Rest strategy
- Learning Curves: Track training and validation metrics across epochs
Ethical and Security Considerations
This project handles real malware samples, requiring precautions:
- Isolated Environment: All experiments conducted in isolated virtual machines without network access
- Public Datasets: Exclusively using public datasets for research purposes
- No Execution: Analysis is completely static, without executing malicious binaries
- Secure Storage: Datasets stored in encrypted partitions
- Responsible Use: Trained models employed exclusively for educational and research purposes
Methodology Summary
This methodology encompasses:- Selection and description of public malware datasets
- Pipeline for preprocessing and converting binaries to images
- Design of custom CNN architectures and use of transfer learning
- Hyperparameter configuration and training strategies
- Evaluation metrics for multi-class classification
- Ethical and security considerations