Training Pipeline Overview
ML Defender uses scikit-learn RandomForest models trained on real malware datasets and converted to ONNX for embedded C++ inference.Dataset
CTU-13 Neris botnet (492K events)
Accuracy
97.6% ransomware detection
Features
83+ flow-based features
Dataset Preparation
CTU-13 Neris Botnet Dataset
Source: Czech Technical University - Malware Capture FacilityUsed in ML Defender: Ransomware behavior validation (source/README.md:289-292) Dataset Characteristics:
Download and Extract
From source/ml-training/README.md:42-51:Additional Datasets
From source/ml-training/README.md:27-40:- CIC-IDS-2017
- CIC-DDoS-2019
Description: Intrusion Detection dataset with 7 attack types
Size: ~1.1 GB (CSV)
Flows: ~2.8 million
Classes: BENIGN, DoS, DDoS, PortScan, Infiltration, Web Attack, Botnet
Size: ~1.1 GB (CSV)
Flows: ~2.8 million
Classes: BENIGN, DoS, DDoS, PortScan, Infiltration, Web Attack, Botnet
Feature Engineering
83-Feature Pipeline
From source/README.md:76: ML Defender extracts 83 flow-based features per packet for RandomForest inference. Feature Categories:Basic Flow Features (15)
Basic Flow Features (15)
- Source/Destination IP, Port
- Protocol (TCP/UDP/ICMP)
- Packet length (min, max, mean, std)
- Flow duration
- Flow IAT (inter-arrival time)
Forward/Backward Statistics (20)
Forward/Backward Statistics (20)
- Forward packet count, byte count
- Backward packet count, byte count
- Forward/backward packet length (min, max, mean, std)
- Forward/backward IAT (min, max, mean, std)
TCP Flags (8)
TCP Flags (8)
- FIN, SYN, RST, PSH, ACK, URG, ECE, CWR counts
Packet Size Features (12)
Packet Size Features (12)
- Subflow forward packets/bytes
- Subflow backward packets/bytes
- Init window size (forward/backward)
- Active/Idle time statistics
Advanced Features (28)
Advanced Features (28)
- Flow bytes/s, packets/s
- Down/Up ratio
- Average packet size
- Segment size average
- Header length statistics
- Bulk transfer features
Feature Extraction Script
From source/ml-training/scripts/:scripts/extract_features.py
Training RandomForest Models
4-Model Architecture
From source/README.md:74-81: ML Defender deploys 4 embedded RandomForest models:- DDoS Detection (97.6% accuracy)
- Ransomware Detection (97.6% on CTU-13)
- Traffic Classification (normal vs. anomalous)
- Internal Anomaly Detection (lateral movement)
Training Script
From source/ml-training/scripts/train_level2_ddos.py:scripts/train_ransomware_detector.py
Expected Training Output
ONNX Conversion
Why ONNX?
ONNX (Open Neural Network Exchange) enables scikit-learn models to run in C++ via ONNX Runtime.Portability
Train in Python, deploy in C++
Performance
Optimized inference (10x faster)
No Dependencies
No scikit-learn in production
Conversion Script
From source/ml-training/scripts/convert_to_onnx.py:scripts/convert_to_onnx.py
Output
Model Evaluation
Validation Metrics
From source/ml-training/README.md:120-133: Target Metrics:- General Attack Detector
- Ransomware Specialist
- DDoS Specialist
- Accuracy: >95%
- Precision: >93%
- Recall: >92%
- F1-Score: >92%
- False Positive Rate: <5%
Validation Script
scripts/validate_models.py
Deployment to C++
Copy Models to ml-detector
C++ ONNX Runtime Integration
From ml-detector/src/onnx_inference.cpp:Retraining Workflow
When to Retrain
Retraining Script
scripts/retrain_pipeline.sh
Next Steps
Stress Testing
Validate model performance under load
eBPF/XDP
Understand feature extraction pipeline
Performance
Optimize inference latency
API Reference
Integrate models in C++