Overview
The UC Intel Final platform supports three main categories of models for malware classification:- Custom CNN: Build convolutional neural networks from scratch
- Transfer Learning: Leverage pre-trained models (ResNet, EfficientNet, Vision Transformer)
- Transformer: Custom vision transformer architectures
Model Building System
All models are built using a builder pattern that converts configuration dictionaries into PyTorch modules. Source:app/training/worker.py:29-42
Custom CNN
When to Use
Use Custom CNN when:
- You have a small dataset (<1000 images per class)
- You want full control over architecture
- You need a lightweight model for deployment
- You want to experiment with novel architectures
- Transfer learning is overkill for your problem
Architecture Components
Custom CNNs are built with configurable convolutional blocks:Design Guidelines
Increase Filters Gradually
Use progression like 32 → 64 → 128 → 256. Each layer should extract higher-level features.
Example Architectures
Lightweight CNN (Small Dataset)
Lightweight CNN (Small Dataset)
Medium CNN (Moderate Dataset)
Medium CNN (Moderate Dataset)
Deep CNN (Large Dataset)
Deep CNN (Large Dataset)
Transfer Learning
When to Use
Use Transfer Learning when:
- You have limited training data (any size dataset)
- You want state-of-the-art performance
- You need faster convergence
- You have access to GPU resources
- You want to leverage features learned from ImageNet
Available Architectures
The platform supports multiple pre-trained backbones:| Architecture | Parameters | Speed | Accuracy | Best For |
|---|---|---|---|---|
| ResNet50 | 25M | Fast | Good | Balanced performance, general use |
| ResNet101 | 44M | Medium | Better | When you need higher accuracy |
| EfficientNet-B0 | 5M | Fast | Good | Limited GPU memory |
| EfficientNet-B3 | 12M | Medium | Better | Balanced efficiency/accuracy |
| EfficientNet-B7 | 66M | Slow | Best | Maximum accuracy, large GPU |
| Vision Transformer (ViT) | 86M | Slow | Best | Large datasets, cutting-edge |
Configuration
Fine-Tuning Strategies
Feature Extraction (Recommended Start)
Set
freeze_backbone=True. Only train the final classification layer.Pros: Fast training, works with small datasets, prevents overfittingUse when: Dataset < 1000 images per classPartial Fine-Tuning
Freeze early layers, unfreeze later layers. Allows backbone to adapt.Pros: Better accuracy, moderate training timeUse when: Dataset 1000-5000 images per class
Architecture Selection Guide
ResNet (Recommended for Most Use Cases)
ResNet (Recommended for Most Use Cases)
When to use:
- General-purpose malware classification
- Good balance of speed and accuracy
- Mature, well-tested architecture
- Standard datasets with moderate complexity
- Limited GPU memory (< 8GB)
- Complex datasets with many classes
- When accuracy is priority over speed
EfficientNet (Best Efficiency)
EfficientNet (Best Efficiency)
When to use:
- Limited GPU resources
- Need fast inference for deployment
- Want best accuracy-to-parameters ratio
- Deployment on edge devices
- Very limited GPU memory (< 4GB)
- Maximum accuracy with efficient architecture
- Moderate to large GPU memory available
Vision Transformer (State-of-the-Art)
Vision Transformer (State-of-the-Art)
When to use:
- Large datasets (5000+ images per class)
- Maximum possible accuracy is required
- You have powerful GPU (8GB+ VRAM)
- Training time is not a constraint
- ViT requires more data than CNNs to perform well
- Training is significantly slower than ResNet/EfficientNet
- Consider using with strong data augmentation
Transformer Models
Custom Vision Transformer
Build custom vision transformer architectures with configurable attention mechanisms.Model Initialization
During training, the model is initialized and moved to the appropriate device: Source:app/training/worker.py:84-103
The platform automatically detects available hardware:
- CUDA: NVIDIA GPUs (preferred)
- MPS: Apple Silicon GPUs (M1/M2/M3)
- CPU: Fallback for systems without GPU
Model Comparison
Decision Tree
Performance Comparison
| Model | Dataset Size | Training Time | Memory | Accuracy |
|---|---|---|---|---|
| Custom CNN (Light) | 500-1K/class | 5-10 min/epoch | 2GB | 75-85% |
| Custom CNN (Medium) | 1K-5K/class | 10-20 min/epoch | 4GB | 80-88% |
| ResNet50 (frozen) | Any | 15-30 min/epoch | 6GB | 85-92% |
| ResNet50 (fine-tuned) | 5K+/class | 30-60 min/epoch | 8GB | 90-95% |
| EfficientNet-B3 | 5K+/class | 40-80 min/epoch | 8GB | 91-96% |
| Vision Transformer | 10K+/class | 60-120 min/epoch | 16GB | 92-97% |
Best Practices
Starting Point
Regularization
- Dropout: 0.5 is standard for dense layers, 0.25-0.3 for conv layers
- L2 Decay: Use 0.0001-0.001 with AdamW optimizer
- Data Augmentation: Essential for preventing overfitting
Input Size
- 224×224: Standard size, works with all pre-trained models
- 256×256: Use for high-resolution malware visualizations
- 128×128: Faster training, good for resource-constrained environments
Troubleshooting
Model Not Learning (Loss Plateaus)
Model Not Learning (Loss Plateaus)
Possible causes:
- Learning rate too high or too low → Try 1e-3 to 1e-5
- Backbone frozen but needs fine-tuning → Set
freeze_backbone=False - Model too simple → Try larger architecture
- Data quality issues → Check dataset preprocessing
Overfitting (Train Acc >> Val Acc)
Overfitting (Train Acc >> Val Acc)
Solutions:
- Increase dropout (try 0.5-0.7)
- Add more data augmentation
- Use smaller model
- Enable L2 regularization
- Reduce number of epochs
- Use early stopping
Out of Memory Errors
Out of Memory Errors
Solutions:
- Reduce batch size (try 16 or 8)
- Use smaller input size (128×128)
- Switch to lighter model (EfficientNet-B0)
- Enable gradient checkpointing (advanced)
Slow Training
Slow Training
Solutions:
- Reduce input size
- Use fewer data augmentation transforms
- Increase num_workers in DataLoader
- Use mixed precision training (advanced)
- Switch to faster model (ResNet50 instead of ViT)
Next Steps
Hyperparameters
Learn how to tune learning rate, optimizers, and schedulers
Model Evaluation
Understand metrics and evaluate your trained models