Overview
ML Defender is a production-ready system with proven stability (17-hour stress test, 2M+ packets processed). However, like all security systems, it has specific architectural limitations that must be understood for proper deployment planning.IPSet Capacity Limits
Maximum Realistic Capacity
ML Defender uses IPSet hash tables for kernel-level IP blocking: Theoretical limits:- Linux kernel IPSet: 65,535 entries per set (hard limit)
- ML Defender default: 1,000 IPs (conservative for testing)
- Maximum realistic: 500,000 IPs (with performance tuning)
Capacity Exceeded Behavior
Stress test results (Day 52):- ✅ Graceful degradation - No crashes or memory leaks
- ✅ Error logging - Each failure logged with metrics
- ⚠️ Oldest entries evicted - FIFO (First In, First Out) with timeout
- ⚠️ No persistence - Evicted IPs are not stored anywhere
Mitigation Strategies
Priority 1 roadmap includes multi-tier storage to address capacity limitations.
-
Increase IPSet size (up to 500K entries):
-
Aggressive timeout (evict IPs faster):
-
Whitelist critical IPs (prevent accidental blocks):
- Multi-tier storage architecture:
- Tier 1: IPSet (hot storage, <10 ms blocking)
- Tier 2: SQLite (warm storage, query historical blocks)
- Tier 3: Parquet (cold storage, forensic analysis)
- Automatic eviction policy: LRU with recidivism tracking
- Capacity monitoring: Prometheus alerts at 80% capacity
Performance Impact of Large IPSets
Benchmarks (estimated, not yet tested):| IPSet Size | Lookup Latency | Memory Usage | Performance |
|---|---|---|---|
| 1,000 IPs | <1 μs | 128 KB | ✅ Optimal |
| 10,000 IPs | <5 μs | 1.2 MB | ✅ Good |
| 100,000 IPs | <50 μs | 12 MB | ⚠️ Acceptable |
| 500,000 IPs | <250 μs | 60 MB | ⚠️ Maximum |
Single-Node Deployment
No High Availability (Yet)
ML Defender currently operates as a single-node system: Limitations:- ❌ No automatic failover if node crashes
- ❌ No load balancing across multiple instances
- ❌ No distributed state synchronization
- ❌ Single point of failure (SPOF)
- Uptime tested: 17+ hours continuous operation
- Crash rate: 0 crashes in 2M+ packets processed
- Memory stability: 4.5 MB footprint (zero growth over 17h)
- Recovery time: Manual restart required (~5 seconds)
Workarounds for High Availability
1. Process monitoring with systemd:- Deploy ML Defender on two nodes
- Use virtual IP (VIP) with keepalived
- Manual failover via VIP migration (~10 seconds)
- Shared etcd cluster for configuration sync
Roadmap for High Availability (Priority 2)
Planned features:- Multi-node clustering with etcd-based leader election
- Distributed IPSet synchronization (gossip protocol)
- Health check endpoints for Kubernetes liveness/readiness probes
- Horizontal scaling with consistent hashing
- Stateless firewall agents (all state in etcd)
No Persistence Layer
Evicted IPs Are Lost
Current behavior:- IPSet entries have a timeout (default: 1 hour)
- After timeout expires, IP is automatically removed from blacklist
- No historical record of blocked IPs (except logs)
- Restarting firewall-acl-agent clears all IPSet entries
Impact on Security Operations
Forensic analysis challenges:- ❌ Cannot query “all IPs blocked in last 7 days”
- ❌ Cannot identify repeat offenders (recidivism analysis)
- ❌ Cannot correlate blocked IPs with incidents
- ⚠️ Log parsing required for historical queries
- ⚠️ No protection against “slow burn” attacks (IP rotates every hour)
- ⚠️ Cannot implement “permanent ban” for known malicious IPs
- ⚠️ Manual IPSet management required after restart
Workarounds
1. RAG ingester for log-based queries:Roadmap for Persistence (Priority 1.1)
Multi-tier storage architecture:- ✅ Unlimited historical storage
- ✅ Fast queries for recent blocks
- ✅ Automatic eviction policy
- ✅ Recidivism tracking (repeat offenders)
- ✅ Compliance and audit trails
Manual Capacity Management
No Automatic Monitoring
Current state:- ❌ No built-in capacity alerts
- ❌ No automatic scaling of IPSet size
- ❌ No proactive eviction strategies
- ⚠️ Manual monitoring required via logs/metrics
Manual Capacity Management Tasks
1. Periodic capacity checks:Roadmap for Automatic Management (Priority 1.3)
Planned features:-
Prometheus metrics exporter:
-
Grafana dashboards with alerts:
- Warning at 80% capacity
- Critical at 90% capacity
- Auto-page on-call engineer
-
Auto-eviction policies:
- LRU (Least Recently Used)
- LFU (Least Frequently Used)
- Confidence-based (evict low-confidence blocks first)
- Recidivism-aware (keep repeat offenders longer)
-
Dynamic capacity scaling:
- Monitor utilization trends
- Auto-resize IPSet when consistently above 80%
- Graceful migration of entries
Other Known Limitations
Performance Constraints
CPU overhead with crypto enabled:- Baseline (no crypto): ~45% CPU @ 364 events/sec
- With crypto: ~54% CPU @ 364 events/sec
- Overhead: +9% (acceptable, but limits headroom)
- Normal operation: 4.5 MB (ml-detector + firewall-acl-agent)
- With crypto: 7.5 MB (includes libsodium + buffers)
- IPSet storage: ~128 KB per 1000 IPs
Protocol Limitations
IPv6 support:- ❌ Not currently implemented
- ⚠️ IPv4-only IPSet configuration
- Roadmap: Priority 3 (IPv6 support)
- ⚠️ Limited visibility into encapsulated traffic (GRE, IPSec, WireGuard)
- ⚠️ Cannot analyze encrypted VPN payloads
- Mitigation: Deploy ML Defender inside VPN termination point
- ⚠️ X-Forwarded-For headers not inspected (only sees load balancer IP)
- Mitigation: Deploy behind load balancer, inspect client IP from logs
Configuration Complexity
JSON-only configuration:- ❌ No GUI for configuration management
- ⚠️ Manual editing required (error-prone)
- Roadmap: Priority 2 (web-based admin interface)
- ⚠️ Most config changes require component restart
- ⚠️ No hot reload (except etcd-based config sync)
- Roadmap: Priority 2 (runtime config updates)
Deployment Recommendations
Capacity Planning Guidelines
Small deployment (home/lab):- IPSet size: 1,000 IPs
- Expected load: <50 events/sec
- Hardware: 2 CPU cores, 4 GB RAM
- IPSet size: 10,000 IPs
- Expected load: <200 events/sec
- Hardware: 4 CPU cores, 8 GB RAM
- IPSet size: 100,000 IPs
- Expected load: <500 events/sec
- Hardware: 8 CPU cores, 16 GB RAM
- Multi-node with load balancing (when available)
Monitoring Requirements
Minimum monitoring (production):- IPSet capacity utilization
- Crypto error rate
- Component health (etcd, firewall-acl-agent, ml-detector)
- Disk space for logs
- Network latency (ZMQ connections)
- Prometheus + Grafana dashboards
- SIEM integration (Splunk, ELK)
- Incident response automation (SOAR)
- Forensic analysis (RAG ingester)
Roadmap Summary
Priority 1: Production Scale (2 weeks)
- P1.1 - Multi-tier storage (IPSet → SQLite → Parquet)
- P1.2 - Async queue + worker pool (1K+ events/sec)
- P1.3 - Capacity monitoring + auto-eviction
Priority 2: Observability (1 week)
- P2.1 - Prometheus metrics exporter
- P2.2 - Grafana dashboards
- P2.3 - Health check endpoints (Kubernetes)
- P2.4 - Runtime config via etcd
Priority 3: High Availability (2 weeks)
- P3.1 - Multi-node clustering (etcd-based)
- P3.2 - Distributed IPSet synchronization
- P3.3 - Load balancing and failover
- P3.4 - IPv6 support
Conclusion
ML Defender is production-ready within its design constraints. Understanding these limitations allows you to:
- Plan capacity appropriately for your deployment
- Deploy monitoring to track utilization
- Architect HA solutions for critical environments
- Integrate complementary security tools (TLS inspection, EDR, SIEM)