RAG System

Overview

The RAG System (Retrieval-Augmented Generation) provides natural language intelligence over ML Defender’s security events. Using TinyLlama for language understanding and FAISS for vector search, it enables forensic queries like “Show me all ransomware detections from 10.0.0.50 last week” without SQL or log parsing.

Components

RAG Ingester: Log parsing + vector embeddings
RAG Server: TinyLlama + FAISS query engine
4 FAISS Indices: Temporal, semantic, benign, malicious
etcd Integration: Service discovery

Capabilities

Natural Language Queries: Ask questions in English
Temporal Analysis: “Last week”, “Yesterday morning”
Pattern Recognition: “Similar to this attack”
ML Retraining Data: Export feature vectors

Architecture

The RAG System consists of two symbiotic services that work together:

RAG Ingester

Multi-Index Strategy

The Ingester maintains 4 specialized FAISS indices for different query patterns:

Chronos Index (Temporal)
SBERT Index (Semantic)
Entity Benign Index
Entity Malicious Index

Dimensions: 128Purpose: Time-series queriesOptimized For:

“Show me attacks from last week”
“What happened on Monday between 2-4 PM?”
“Hourly attack trends”

Embedding Model: Chronos temporal encoder (ONNX)PCA: 512d → 128d reduction for efficient storage

Eventual Consistency

The Ingester uses best-effort commits for high availability:

// MultiIndexManager commits indices independently
void MultiIndexManager::commit_all() {
    std::vector<std::future<bool>> futures;
    
    // Parallel commits (non-blocking)
    futures.push_back(std::async(std::launch::async, 
        [this] { return chronos_index_->commit(); }));
    futures.push_back(std::async(std::launch::async, 
        [this] { return sbert_index_->commit(); }));
    futures.push_back(std::async(std::launch::async, 
        [this] { return benign_index_->commit(); }));
    futures.push_back(std::async(std::launch::async, 
        [this] { return malicious_index_->commit(); }));
    
    // Track failures but don't block
    int successes = 0;
    for (auto& future : futures) {
        if (future.get()) successes++;
    }
    
    // Availability > Consistency: Better 3/4 than 0/4
    if (successes >= 2) {
        logger_->info("Commit successful: {}/4 indices", successes);
    } else {
        logger_->warn("Partial commit: {}/4 indices", successes);
    }
}

Design Philosophy: Availability over Consistency. Better to have 3/4 indices working than to block and have 0/4.

Configuration

{
  "service": {
    "id": "rag-ingester-default",
    "version": "0.1.0",
    "etcd": {
      "endpoints": ["127.0.0.1:2379"],
      "heartbeat_interval_sec": 10,
      "partner_detector": "ml-detector-default"
    }
  },
  
  "ingester": {
    "input": {
      "source": "file_watch",
      "directory": "/vagrant/logs/rag/synthetic/artifacts",
      "pattern": "*.pb.enc",
      "encrypted": true,
      "compressed": true,
      "delete_after_process": false
    },
    
    "threading": {
      "mode": "single",
      "embedding_workers": 1,
      "indexing_workers": 1
    },
    
    "embedders": {
      "chronos": {
        "enabled": true,
        "onnx_path": "/vagrant/rag-ingester/models/onnx/chronos.onnx",
        "input_dim": 83,
        "output_dim": 512
      },
      "sbert": {
        "enabled": true,
        "onnx_path": "/vagrant/rag-ingester/models/onnx/sbert.onnx",
        "input_dim": 83,
        "output_dim": 384
      },
      "attack": {
        "enabled": true,
        "onnx_path": "/vagrant/rag-ingester/models/onnx/attack.onnx",
        "input_dim": 83,
        "output_dim": 256,
        "benign_sample_rate": 0.1
      }
    },
    
    "pca": {
      "enabled": true,
      "chronos_model": "/vagrant/rag-ingester/models/pca/chronos_512_128.faiss",
      "sbert_model": "/vagrant/rag-ingester/models/pca/sbert_384_96.faiss",
      "attack_model": "/vagrant/rag-ingester/models/pca/attack_256_64.faiss"
    },
    
    "faiss": {
      "index_type": "Flat",
      "metric": "L2",
      "persist_path": "/shared/faiss_indexes",
      "checkpoint_interval_events": 1000
    },
    
    "health": {
      "cv_warning_threshold": 0.20,
      "cv_critical_threshold": 0.15,
      "report_to_etcd": true
    }
  }
}

Threading Modes

Single-threaded (Raspberry Pi)
Multi-threaded (Server)

{
  "threading": {
    "mode": "single",
    "embedding_workers": 1,
    "indexing_workers": 1
  }
}

Memory: ~310MBUse Case: Resource-constrained environments

{
  "threading": {
    "mode": "parallel",
    "embedding_workers": 3,
    "indexing_workers": 4
  }
}

Memory: ~2GBUse Case: 64-core servers, high-throughput ingestion

RAG Server (TinyLlama)

Natural Language Query Processing

The RAG Server uses TinyLlama (1.1B parameters) for query understanding:

Query Understanding

User Query: “Show me all ransomware detections from 10.0.0.50 last week”TinyLlama Extracts:

{
  "intent": "search",
  "attack_type": "ransomware",
  "source_ip": "10.0.0.50",
  "time_range": {
    "start": "2025-11-01T00:00:00Z",
    "end": "2025-11-08T00:00:00Z"
  },
  "index_strategy": ["entity_malicious", "chronos"]
}

Vector Search

FAISS Queries (parallel):

Entity Malicious Index: Find all events from 10.0.0.50
Chronos Index: Filter by time range (last week)

Results: 47 matching events

Result Aggregation

TinyLlama Summarizes:

Found 47 ransomware detections from 10.0.0.50 last week:

- Nov 1, 14:23: Initial C&C callback (15 external IPs contacted)
- Nov 1, 14:25: SMB lateral movement (8 hosts infected)
- Nov 1, 14:30: Encryption started (payload entropy 7.9)
- Nov 2-7: Daily C&C check-ins (total 39 events)

Blocked: Yes (added to IPSet on Nov 1, 14:23)
Recidivism: 39 attempts after block

Example Queries

"What attacks happened yesterday?"
"Show me DDoS events from last Monday"
"Hourly attack trends for the past week"
"Traffic patterns between 2-4 AM"

ML Retraining Data Export

The RAG System can export feature vectors for ML model retraining:

# Query via RAG API
query = """
Export all ransomware detections from the past 30 days
with ground truth labels (blocked = positive, 
                          false_positive = negative)
"""

response = rag_client.query(query)

# Returns Parquet file with:
# - 83 features per event
# - Ground truth labels
# - Metadata (timestamp, IP, attack_type)
df = pd.read_parquet(response.export_path)

print(df.shape)  # (12847, 86)
# 83 features + ground_truth + timestamp + source_ip

Use Cases:

Model drift detection: Compare new data distribution vs training data
Incremental training: Retrain RandomForest on recent attacks
False positive analysis: Identify mislabeled events

Deployment

Prerequisites

sudo apt-get install -y \
    build-essential cmake \
    libzmq3-dev libprotobuf-dev \
    liblz4-dev nlohmann-json3-dev libspdlog-dev

# FAISS (compile from source)
git clone https://github.com/facebookresearch/faiss.git
cd faiss
cmake -B build -DFAISS_ENABLE_GPU=OFF .
make -C build -j$(nproc)
sudo make -C build install

# ONNX Runtime
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.0/onnxruntime-linux-x64-1.16.0.tgz
tar -xzf onnxruntime-linux-x64-1.16.0.tgz
sudo cp -r onnxruntime-linux-x64-1.16.0/lib/* /usr/local/lib/

Build RAG Ingester

Navigate

cd /vagrant/rag-ingester
mkdir -p build && cd build

Configure

cmake .. -DCMAKE_BUILD_TYPE=Release

Compile

make -j$(nproc)

Run RAG Ingester

./rag-ingester /vagrant/rag-ingester/config/rag-ingester.json

Real-time Output:

[RAG-INGESTER] 🚀 Starting RAG Ingester v0.1.0
[RAG-INGESTER] 📁 Watching directory: /vagrant/logs/rag/synthetic/artifacts
[RAG-INGESTER] 🧠 Loaded 3 ONNX models:
  - Chronos: 83 → 512 dimensions
  - SBERT: 83 → 384 dimensions
  - Attack: 83 → 256 dimensions
[RAG-INGESTER] 📊 PCA enabled: 512→128, 384→96, 256→64
[RAG-INGESTER] 📂 Created 4 FAISS indices (Flat, L2 metric)
[RAG-INGESTER] ✅ Ready for ingestion

[INGESTION] File: ml_detector_2025-11-01_14-23-15.pb.enc
[DECRYPT] ChaCha20-Poly1305 decryption: OK
[DECOMPRESS] LZ4 decompression: OK (1024 bytes)
[PARSE] Protobuf parsed: 47 events
[EMBED] Chronos: 47 vectors (512d)
[EMBED] SBERT: 47 vectors (384d)
[EMBED] Attack: 47 vectors (256d)
[PCA] Dimensionality reduction: 512→128, 384→ 96, 256→64
[INDEX] Added to 4 FAISS indices
[COMMIT] Checkpoint: 1000 events indexed

Run RAG Server

cd /vagrant/rag
python rag_server.py --config config/rag_config.json

Troubleshooting

FAISS Index Not Found

# Verify index files exist
ls -lh /shared/faiss_indexes/

# Should see:
# chronos_index.faiss
# sbert_index.faiss
# benign_index.faiss
# malicious_index.faiss

# Rebuild indices if missing
./rag-ingester --rebuild-indices

ONNX Model Loading Fails

# Verify ONNX Runtime installation
ldconfig -p | grep onnxruntime

# Check model files exist
ls -lh /vagrant/rag-ingester/models/onnx/*.onnx

# Validate ONNX models
python -c "import onnx; onnx.checker.check_model('chronos.onnx')"

TinyLlama Out of Memory

Symptom: OOM error during query processingSolution: Use 4-bit quantization:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    load_in_4bit=True,
    device_map="auto"
)

Memory: 2GB → 600MB

File Watcher Not Detecting Files

# Check inotify limits
cat /proc/sys/fs/inotify/max_user_watches

# Increase if needed
echo 524288 | sudo tee /proc/sys/fs/inotify/max_user_watches

# Make persistent
echo "fs.inotify.max_user_watches=524288" | \
  sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Roadmap

Priority 1.1: Firewall Log Parsing

Goal: Ingest firewall-agent logs for ground truth linking

Detection ↔ Block Linking

Link ML Detector events to Firewall Agent blocks:

[ML-DETECTOR] 10.0.0.50 → Ransomware (14:23:15)
      ↓ (5ms latency)
[FIREWALL] 10.0.0.50 added to IPSet (14:23:15)

Cross-component Queries

“Show me all detections that were NOT blocked”“What’s the latency between detection and blocking?”

Priority 1.2: Temporal Queries

Goal: Natural language time expressions

"Yesterday morning" → 2025-11-07 06:00-12:00
"Last Monday" → 2025-11-03 00:00-23:59
"Past 3 hours" → now - 3h to now

Priority 1.3: Aggregation & Statistics

Goal: Summary queries

"Top 10 malicious IPs this month"
"Hourly attack distribution"
"Recidivism rate (% of blocked IPs that retry)"

Next Steps

Sniffer

Configure network packet capture

ML Detector

Set up ML inference pipeline

Firewall Agent

Deploy autonomous blocking

Model Training

Retrain models with RAG-exported data

Overview

Getting Started

Components

Operations

Security

Overview

Components

Capabilities

Architecture

RAG Ingester

Multi-Index Strategy

Eventual Consistency

Configuration

Threading Modes

RAG Server (TinyLlama)

Natural Language Query Processing

Example Queries

ML Retraining Data Export

Deployment

Prerequisites

Build RAG Ingester

Run RAG Ingester

Run RAG Server

Troubleshooting

Roadmap

Priority 1.1: Firewall Log Parsing

Priority 1.2: Temporal Queries

Priority 1.3: Aggregation & Statistics

Next Steps

Sniffer

ML Detector

Firewall Agent

Model Training

Build docs developers (and LLMs) love

Overview

Getting Started

Components

Operations

Security

​Overview

Components

Capabilities

​Architecture

​RAG Ingester

​Multi-Index Strategy

​Eventual Consistency

​Configuration

​Threading Modes

​RAG Server (TinyLlama)

​Natural Language Query Processing

​Example Queries

​ML Retraining Data Export

​Deployment

​Prerequisites

​Build RAG Ingester

​Run RAG Ingester

​Run RAG Server

​Troubleshooting

​Roadmap

​Priority 1.1: Firewall Log Parsing

​Priority 1.2: Temporal Queries

​Priority 1.3: Aggregation & Statistics

​Next Steps

Sniffer

ML Detector

Firewall Agent

Model Training

Build docs developers (and LLMs) love

Overview

Architecture

RAG Ingester

Multi-Index Strategy

Eventual Consistency

Configuration

Threading Modes

RAG Server (TinyLlama)

Natural Language Query Processing

Example Queries

ML Retraining Data Export

Deployment

Prerequisites

Build RAG Ingester

Run RAG Ingester

Run RAG Server

Troubleshooting

Roadmap

Priority 1.1: Firewall Log Parsing

Priority 1.2: Temporal Queries

Priority 1.3: Aggregation & Statistics

Next Steps