Skip to main content

UC Intel Final - Malware Classification Platform

An advanced ensemble machine learning platform for classifying malware using deep learning techniques with PyTorch. This project provides a professional, multi-page Streamlit dashboard for building, training, and evaluating malware classification models.

What is Malware Image Classification?

Malware binaries can be visualized as grayscale images, where each byte is mapped to a pixel intensity value. This visual representation allows deep learning models to detect patterns and classify malware families based on their structural characteristics.

Platform Overview

The UC Intel Final platform provides a complete end-to-end solution for:

Dataset Management

Configure train/validation/test splits, apply preprocessing, and set up data augmentation pipelines

Model Builder

Design custom CNNs, use transfer learning with pre-trained models, or build transformer architectures

Training Pipeline

Train with customizable hyperparameters, live monitoring, and automatic checkpointing

Model Interpretability

Visualize model decisions with Grad-CAM, analyze misclassifications, and explore embeddings

Key Features

Professional Streamlit Dashboard

  • Multi-page architecture with self-contained modules
  • Theme customization with color presets and CSS injection
  • Session management for saving and resuming work
  • Real-time training monitoring with live metrics updates

Flexible Model Architectures

# Build custom architectures layer by layer
from models.pytorch.cnn_builder import CustomCNNBuilder

config = {
    "cnn_config": {
        "layers": [
            {"type": "Conv2D", "filters": 64, "kernel_size": 3, "activation": "relu"},
            {"type": "MaxPool", "pool_size": 2},
            {"type": "Conv2D", "filters": 128, "kernel_size": 3, "activation": "relu"},
            {"type": "Flatten"},
            {"type": "Dense", "units": 256, "activation": "relu"},
        ]
    },
    "num_classes": 9
}

builder = CustomCNNBuilder(config)
model = builder.build()

Comprehensive Training Engine

The training pipeline includes:
  • Multiple optimizers: Adam, AdamW, SGD with Momentum, RMSprop
  • Learning rate schedulers: ReduceLROnPlateau, Cosine Annealing, Step Decay, Exponential
  • Class imbalance handling: Auto class weights, Focal Loss
  • Early stopping with configurable patience
  • Automatic checkpointing for best models
  • Real-time metrics: Loss, accuracy, precision, recall, F1-score

Advanced Data Augmentation

The platform provides three built-in augmentation presets:Light Augmentation
  • Rotation: ±10°
  • Horizontal flip: 50%
  • Brightness: ±10%
Moderate Augmentation
  • Rotation: ±20°
  • Horizontal flip: 50%
  • Vertical flip: 30%
  • Brightness: ±20%
  • Contrast: ±20%
Heavy Augmentation
  • Rotation: ±30°
  • Horizontal & vertical flip: 50%
  • Brightness: ±30%
  • Contrast: ±30%
  • Gaussian noise: 5%

Who is This For?

Researchers & Students

Ideal for academic projects and experiments in:
  • Deep learning for cybersecurity
  • Malware analysis and classification
  • Computer vision applications
  • Model interpretability research

ML Engineers

Provides a production-ready framework for:
  • Rapid prototyping of CNN architectures
  • Transfer learning experimentation
  • Hyperparameter tuning and optimization
  • Model performance benchmarking

Security Analysts

Enables security teams to:
  • Build custom malware classifiers
  • Analyze model predictions with Grad-CAM
  • Identify misclassification patterns
  • Evaluate model robustness

Architecture Principles

1

Self-Contained Pages

Each page in the content/ directory is fully self-contained with its own folder structure
2

State Management

All session state access goes through abstraction layers in state/ module (no direct st.session_state access)
3

Tab-Based Organization

Complex pages split content into multiple tab files for better code organization
4

Flat Components

Shared UI components stay in a flat components/ directory, not nested

Project Structure

app/
├── main.py                      # Entry point + navigation
├── content/                     # Self-contained page modules
│   ├── home/                   # Home & session setup
│   ├── dataset/                # Dataset configuration (4 tabs)
│   ├── model/                  # Model architecture builder
│   ├── training/               # Training configuration
│   ├── monitor/                # Live training monitor
│   ├── results/                # Results & evaluation
│   └── interpret/              # Model interpretability
├── components/                  # Shared UI components
│   ├── header.py               # App header with session info
│   ├── sidebar.py              # Configuration status
│   ├── theme.py                # Theme customization
│   └── utils.py                # GPU detection, system info
├── state/                       # Session state management
│   ├── workflow.py             # ML workflow state
│   ├── ui.py                   # UI preferences
│   └── cache.py                # Cached data
├── models/                      # Model builders
│   └── pytorch/
│       ├── cnn_builder.py      # Custom CNN builder
│       ├── transfer.py         # Transfer learning
│       └── transformer.py      # Transformer models
├── training/                    # Training infrastructure
│   ├── engine.py               # Core training loop
│   ├── dataset.py              # PyTorch datasets
│   ├── transforms.py           # Data augmentation
│   └── optimizers.py           # Optimizer configuration
└── utils/                       # Utility functions
    ├── dataset_utils.py        # Dataset scanning
    └── dataset_viz.py          # Visualization helpers

Technology Stack

PyTorch

Deep learning framework for building and training neural networks

Streamlit

Interactive dashboard for the complete ML workflow

torchvision

Pre-trained models and image transformations

scikit-learn

Metrics calculation and evaluation tools

Plotly

Interactive visualizations and charts

UMAP

Dimensionality reduction for embedding visualization

Next Steps

Quick Start

Get up and running in 5 minutes

Installation

Detailed installation instructions
This platform was developed as part of the Sistemas Inteligentes II course at Universidad de Caldas, taught by Professor Jorge Alberto Jaramillo Garzón.

Build docs developers (and LLMs) love