Skip to main content

Overview

ML Defender is a production-ready system with proven stability (17-hour stress test, 2M+ packets processed). However, like all security systems, it has specific architectural limitations that must be understood for proper deployment planning.
Understanding limitations is critical for security architecture. Deploy ML Defender as part of a layered defense strategy, not as a standalone solution.

IPSet Capacity Limits

Maximum Realistic Capacity

ML Defender uses IPSet hash tables for kernel-level IP blocking: Theoretical limits:
  • Linux kernel IPSet: 65,535 entries per set (hard limit)
  • ML Defender default: 1,000 IPs (conservative for testing)
  • Maximum realistic: 500,000 IPs (with performance tuning)
Current configuration:
// firewall-acl-agent/config/firewall.json
{
  "ipsets": {
    "blacklist": {
      "set_name": "ml_defender_blacklist_test",
      "max_elements": 1000,
      "hash_size": 1024,
      "timeout": 3600
    }
  }
}

Capacity Exceeded Behavior

Stress test results (Day 52):
ipset_successes: 118          # First ~1000 IPs blocked successfully
ipset_failures: 16,681        # Attempts after capacity reached
max_queue_depth: 16,690       # Backpressure handled gracefully
What happens when capacity is exceeded:
  1. Graceful degradation - No crashes or memory leaks
  2. Error logging - Each failure logged with metrics
  3. ⚠️ Oldest entries evicted - FIFO (First In, First Out) with timeout
  4. ⚠️ No persistence - Evicted IPs are not stored anywhere

Mitigation Strategies

Priority 1 roadmap includes multi-tier storage to address capacity limitations.
Short-term mitigations:
  1. Increase IPSet size (up to 500K entries):
    {
      "ipsets": {
        "blacklist": {
          "max_elements": 100000,
          "hash_size": 16384
        }
      }
    }
    
  2. Aggressive timeout (evict IPs faster):
    {
      "ipsets": {
        "blacklist": {
          "timeout": 1800  // 30 minutes instead of 1 hour
        }
      }
    }
    
  3. Whitelist critical IPs (prevent accidental blocks):
    {
      "validation": {
        "allowed_ip_ranges": [
          "192.168.1.0/24",  // Internal network
          "10.0.0.0/8"       // Corporate VPN
        ],
        "block_localhost": false,
        "block_gateway": false
      }
    }
    
Long-term solution (Priority 1.1):
  • Multi-tier storage architecture:
    • Tier 1: IPSet (hot storage, <10 ms blocking)
    • Tier 2: SQLite (warm storage, query historical blocks)
    • Tier 3: Parquet (cold storage, forensic analysis)
  • Automatic eviction policy: LRU with recidivism tracking
  • Capacity monitoring: Prometheus alerts at 80% capacity

Performance Impact of Large IPSets

Benchmarks (estimated, not yet tested):
IPSet SizeLookup LatencyMemory UsagePerformance
1,000 IPs<1 μs128 KB✅ Optimal
10,000 IPs<5 μs1.2 MB✅ Good
100,000 IPs<50 μs12 MB⚠️ Acceptable
500,000 IPs<250 μs60 MB⚠️ Maximum
IPSets larger than 100,000 entries may cause noticeable latency on low-end hardware. Test performance in your environment before production deployment.

Single-Node Deployment

No High Availability (Yet)

ML Defender currently operates as a single-node system: Limitations:
  • ❌ No automatic failover if node crashes
  • ❌ No load balancing across multiple instances
  • ❌ No distributed state synchronization
  • ❌ Single point of failure (SPOF)
Availability characteristics:
  • Uptime tested: 17+ hours continuous operation
  • Crash rate: 0 crashes in 2M+ packets processed
  • Memory stability: 4.5 MB footprint (zero growth over 17h)
  • Recovery time: Manual restart required (~5 seconds)

Workarounds for High Availability

1. Process monitoring with systemd:
# /etc/systemd/system/ml-defender-firewall.service
[Unit]
Description=ML Defender Firewall ACL Agent
After=network.target etcd-server.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/ml-defender/firewall-acl-agent/build
ExecStart=/opt/ml-defender/firewall-acl-agent/build/firewall-acl-agent \
  -c /opt/ml-defender/firewall-acl-agent/config/firewall.json
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target
2. etcd-server auto-restart feature:
// etcd-server/config/etcd-server.json
{
  "auto_restart": {
    "enabled": true,
    "max_attempts": 3,
    "delay_seconds": 5,
    "commands": {
      "firewall-acl-agent": "cd /vagrant/firewall-acl-agent/build && sudo ./firewall-acl-agent -c ../config/firewall.json"
    }
  }
}
3. Dual-node active-passive (manual failover):
  • Deploy ML Defender on two nodes
  • Use virtual IP (VIP) with keepalived
  • Manual failover via VIP migration (~10 seconds)
  • Shared etcd cluster for configuration sync

Roadmap for High Availability (Priority 2)

Planned features:
  • Multi-node clustering with etcd-based leader election
  • Distributed IPSet synchronization (gossip protocol)
  • Health check endpoints for Kubernetes liveness/readiness probes
  • Horizontal scaling with consistent hashing
  • Stateless firewall agents (all state in etcd)

No Persistence Layer

Evicted IPs Are Lost

Current behavior:
  • IPSet entries have a timeout (default: 1 hour)
  • After timeout expires, IP is automatically removed from blacklist
  • No historical record of blocked IPs (except logs)
  • Restarting firewall-acl-agent clears all IPSet entries
Stress test example:
ipset_successes: 118          # Only 118 IPs persisted in IPSet
ipset_failures: 16,681        # 16,681 IPs were dropped (no storage)

Impact on Security Operations

Forensic analysis challenges:
  • ❌ Cannot query “all IPs blocked in last 7 days”
  • ❌ Cannot identify repeat offenders (recidivism analysis)
  • ❌ Cannot correlate blocked IPs with incidents
  • ⚠️ Log parsing required for historical queries
Operational challenges:
  • ⚠️ No protection against “slow burn” attacks (IP rotates every hour)
  • ⚠️ Cannot implement “permanent ban” for known malicious IPs
  • ⚠️ Manual IPSet management required after restart

Workarounds

1. RAG ingester for log-based queries:
# Query blocked IPs from logs
cd rag/
python rag_query.py "Show me all IPs blocked in the last 24 hours"

# Output:
# 192.168.1.100 (blocked 47 times, last: 2026-03-01 14:32:15)
# 10.0.50.23 (blocked 12 times, last: 2026-03-01 14:30:45)
# ...
2. External database integration (custom):
# Example: Log blocked IPs to PostgreSQL
import psycopg2

def log_blocked_ip(ip, confidence, timestamp):
    conn = psycopg2.connect("dbname=ml_defender user=admin")
    cur = conn.cursor()
    cur.execute(
        "INSERT INTO blocked_ips (ip, confidence, timestamp) VALUES (%s, %s, %s)",
        (ip, confidence, timestamp)
    )
    conn.commit()
3. IPSet flush prevention:
// firewall-acl-agent/config/firewall.json
{
  "ipsets": {
    "blacklist": {
      "flush_on_startup": false  // Preserve entries on restart
    }
  }
}

Roadmap for Persistence (Priority 1.1)

Multi-tier storage architecture:
┌─────────────────────────────────────────────────────┐
│  Tier 1: IPSet (Hot Storage)                        │
│    - Capacity: 1K-10K IPs                           │
│    - Latency: &lt;10 ms                                │
│    - Use case: Real-time blocking                   │
└─────────────────────────────────────────────────────┘
              ↓ (eviction after 1 hour)
┌─────────────────────────────────────────────────────┐
│  Tier 2: SQLite (Warm Storage)                      │
│    - Capacity: 1M IPs                               │
│    - Latency: &lt;100 ms                               │
│    - Use case: Recidivism detection, queries        │
└─────────────────────────────────────────────────────┘
              ↓ (archival after 30 days)
┌─────────────────────────────────────────────────────┐
│  Tier 3: Parquet (Cold Storage)                     │
│    - Capacity: Unlimited (S3/MinIO)                 │
│    - Latency: &lt;1 second                             │
│    - Use case: Forensic analysis, compliance        │
└─────────────────────────────────────────────────────┘
Benefits:
  • ✅ Unlimited historical storage
  • ✅ Fast queries for recent blocks
  • ✅ Automatic eviction policy
  • ✅ Recidivism tracking (repeat offenders)
  • ✅ Compliance and audit trails

Manual Capacity Management

No Automatic Monitoring

Current state:
  • ❌ No built-in capacity alerts
  • ❌ No automatic scaling of IPSet size
  • ❌ No proactive eviction strategies
  • ⚠️ Manual monitoring required via logs/metrics
Capacity indicators:
# Check current IPSet usage
sudo ipset list ml_defender_blacklist_test | wc -l
# Output: 987 (out of 1000 max)

# Monitor capacity in metrics file
cat /vagrant/logs/lab/firewall-metrics.json | jq '.ipset_capacity'
# Output: {"used": 987, "max": 1000, "utilization": 0.987}

Manual Capacity Management Tasks

1. Periodic capacity checks:
#!/bin/bash
# /usr/local/bin/check_ipset_capacity.sh

MAX_CAPACITY=1000
WARN_THRESHOLD=800  # 80%

USED=$(sudo ipset list ml_defender_blacklist_test | grep -c "^[0-9]")
UTILIZATION=$(echo "scale=2; $USED / $MAX_CAPACITY" | bc)

if [ $USED -gt $WARN_THRESHOLD ]; then
  echo "[WARN] IPSet capacity: $USED/$MAX_CAPACITY (${UTILIZATION}%)"
  # Send alert (email, Slack, PagerDuty, etc.)
fi
2. Manual IPSet flush (when needed):
# Flush all entries (CAUTION: removes all blocks)
sudo ipset flush ml_defender_blacklist_test

# Or remove specific IPs
sudo ipset del ml_defender_blacklist_test 192.168.1.100
3. Dynamic IPSet resizing (requires restart):
# Stop firewall-acl-agent
sudo systemctl stop ml-defender-firewall

# Destroy old IPSet
sudo ipset destroy ml_defender_blacklist_test

# Edit config (increase max_elements to 10000)
vim /opt/ml-defender/firewall-acl-agent/config/firewall.json

# Restart (will recreate IPSet with new size)
sudo systemctl start ml-defender-firewall

Roadmap for Automatic Management (Priority 1.3)

Planned features:
  • Prometheus metrics exporter:
    ml_defender_ipset_capacity{set="blacklist"} 1000
    ml_defender_ipset_used{set="blacklist"} 987
    ml_defender_ipset_utilization{set="blacklist"} 0.987
    
  • Grafana dashboards with alerts:
    • Warning at 80% capacity
    • Critical at 90% capacity
    • Auto-page on-call engineer
  • Auto-eviction policies:
    • LRU (Least Recently Used)
    • LFU (Least Frequently Used)
    • Confidence-based (evict low-confidence blocks first)
    • Recidivism-aware (keep repeat offenders longer)
  • Dynamic capacity scaling:
    • Monitor utilization trends
    • Auto-resize IPSet when consistently above 80%
    • Graceful migration of entries

Other Known Limitations

Performance Constraints

CPU overhead with crypto enabled:
  • Baseline (no crypto): ~45% CPU @ 364 events/sec
  • With crypto: ~54% CPU @ 364 events/sec
  • Overhead: +9% (acceptable, but limits headroom)
Memory footprint:
  • Normal operation: 4.5 MB (ml-detector + firewall-acl-agent)
  • With crypto: 7.5 MB (includes libsodium + buffers)
  • IPSet storage: ~128 KB per 1000 IPs

Protocol Limitations

IPv6 support:
  • ❌ Not currently implemented
  • ⚠️ IPv4-only IPSet configuration
  • Roadmap: Priority 3 (IPv6 support)
VPN/Tunnel traffic:
  • ⚠️ Limited visibility into encapsulated traffic (GRE, IPSec, WireGuard)
  • ⚠️ Cannot analyze encrypted VPN payloads
  • Mitigation: Deploy ML Defender inside VPN termination point
Load balancer scenarios:
  • ⚠️ X-Forwarded-For headers not inspected (only sees load balancer IP)
  • Mitigation: Deploy behind load balancer, inspect client IP from logs

Configuration Complexity

JSON-only configuration:
  • ❌ No GUI for configuration management
  • ⚠️ Manual editing required (error-prone)
  • Roadmap: Priority 2 (web-based admin interface)
Restart required for config changes:
  • ⚠️ Most config changes require component restart
  • ⚠️ No hot reload (except etcd-based config sync)
  • Roadmap: Priority 2 (runtime config updates)

Deployment Recommendations

Capacity Planning Guidelines

Small deployment (home/lab):
  • IPSet size: 1,000 IPs
  • Expected load: <50 events/sec
  • Hardware: 2 CPU cores, 4 GB RAM
Medium deployment (small business):
  • IPSet size: 10,000 IPs
  • Expected load: <200 events/sec
  • Hardware: 4 CPU cores, 8 GB RAM
Large deployment (enterprise):
  • IPSet size: 100,000 IPs
  • Expected load: <500 events/sec
  • Hardware: 8 CPU cores, 16 GB RAM
  • Multi-node with load balancing (when available)

Monitoring Requirements

Minimum monitoring (production):
  1. IPSet capacity utilization
  2. Crypto error rate
  3. Component health (etcd, firewall-acl-agent, ml-detector)
  4. Disk space for logs
  5. Network latency (ZMQ connections)
Advanced monitoring (enterprise):
  1. Prometheus + Grafana dashboards
  2. SIEM integration (Splunk, ELK)
  3. Incident response automation (SOAR)
  4. Forensic analysis (RAG ingester)

Roadmap Summary

Priority 1: Production Scale (2 weeks)

  • P1.1 - Multi-tier storage (IPSet → SQLite → Parquet)
  • P1.2 - Async queue + worker pool (1K+ events/sec)
  • P1.3 - Capacity monitoring + auto-eviction

Priority 2: Observability (1 week)

  • P2.1 - Prometheus metrics exporter
  • P2.2 - Grafana dashboards
  • P2.3 - Health check endpoints (Kubernetes)
  • P2.4 - Runtime config via etcd

Priority 3: High Availability (2 weeks)

  • P3.1 - Multi-node clustering (etcd-based)
  • P3.2 - Distributed IPSet synchronization
  • P3.3 - Load balancing and failover
  • P3.4 - IPv6 support

Conclusion

ML Defender is production-ready within its design constraints. Understanding these limitations allows you to:
  1. Plan capacity appropriately for your deployment
  2. Deploy monitoring to track utilization
  3. Architect HA solutions for critical environments
  4. Integrate complementary security tools (TLS inspection, EDR, SIEM)
The roadmap addresses all major limitations in a methodical, prioritized manner.
Never deploy ML Defender as a standalone security solution. Use it as part of a defense-in-depth strategy alongside firewalls, IDS/IPS, endpoint protection, and SIEM.

Build docs developers (and LLMs) love