Skip to main content

Overview

Pensar Apex supports local vLLM models for offline penetration testing without relying on cloud API providers. This enables:
  • Air-gapped testing - Run pentests in isolated networks without internet access
  • Cost savings - No per-token API charges
  • Data privacy - All inference happens locally
  • Custom models - Use fine-tuned models optimized for security testing
Anthropic models (Claude) provide the best performance for penetration testing. vLLM is recommended for specialized use cases where cloud APIs are not feasible.

What is vLLM?

vLLM is a high-throughput inference engine for large language models. It provides:
  • Fast inference with PagedAttention and continuous batching
  • OpenAI-compatible API - drop-in replacement for OpenAI endpoints
  • Quantization support - run large models on consumer GPUs
  • Multi-GPU support - scale inference across multiple GPUs

Quick Start

1

Install vLLM

Install vLLM with GPU support:
# CUDA 12.1
pip install vllm

# Or with Docker
docker pull vllm/vllm-openai:latest
2

Start vLLM Server

Launch the vLLM server with your chosen model:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2  # For multi-GPU
Or with Docker:
docker run --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct
3

Configure Apex

Set the LOCAL_MODEL_URL environment variable:
export LOCAL_MODEL_URL="http://localhost:8000/v1"
Or add to .env:
# .env
LOCAL_MODEL_URL=http://localhost:8000/v1
4

Select Local Model in Apex

In the Apex TUI, navigate to the Models screen:
  1. Enter your model name in the “Custom local model (vLLM)” input
  2. Example: meta-llama/Llama-3.1-70B-Instruct
  3. The model will now appear in the model selection list
5

Run Pentest with Local Model

pensar pentest \
  --target https://example.com \
  --model meta-llama/Llama-3.1-70B-Instruct

Configuration

Environment Variables

Apex detects local model configuration via LOCAL_MODEL_URL:
// From src/core/providers/utils.ts
export function isProviderConfigured(
  providerId: ProviderType,
  config: Config,
): boolean {
  switch (providerId) {
    case "local":
      return !!(
        config.localModelUrl ||
        config.localModelName ||
        process.env.LOCAL_MODEL_URL
      );
    // ...
  }
}
Options:
VariableDescriptionExample
LOCAL_MODEL_URLvLLM server endpointhttp://localhost:8000/v1
localModelNameModel identifiermeta-llama/Llama-3.1-70B-Instruct

Provider Configuration

// From src/core/providers/types.ts
{
  id: "local",
  name: "Local LLM",
  description: "OpenAI-compatible local model (vLLM, LM Studio, Ollama)",
  requiresAPIKey: false,
}
vLLM endpoints are OpenAI-compatible, so no API key is required.

Model Selection

Apex automatically adds your configured local model to the available models list:
// From src/core/providers/utils.ts
if (isProviderConfigured("local", config) && config.localModelName) {
  models.push({
    id: config.localModelName,
    name: config.localModelName,
    provider: "local",
  });
}
The model name you provide in the TUI becomes the model ID used for API requests.

For Penetration Testing

These open-source models perform well for security testing:
ModelSizeVRAM RequiredBest For
Llama 3.1 70B Instruct70B~40GB (FP16)Comprehensive pentesting
Llama 3.1 8B Instruct8B~8GB (FP16)Quick scans, resource-constrained
DeepSeek Coder 33B33B~20GB (FP16)Code analysis (whitebox)
Qwen 2.5 Coder 32B32B~20GB (FP16)Code security reviews
Mistral Large 2123B~70GB (FP16)Enterprise pentesting
Open-source models do not match Claude 4.5 Sonnet in security reasoning and vulnerability detection. Use Anthropic models when possible.

Quantization Options

For consumer GPUs, use quantized models:
QuantizationVRAM SavingsQuality Loss
FP16BaselineNone
INT8~50%Minimal
INT4~75%Moderate
Example with quantization:
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
  --quantization awq \
  --dtype half

vLLM Server Setup

Standalone Server

# Install vLLM
pip install vllm

# Download model (first run)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --download-dir ~/.cache/vllm

# Serve model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4  # Use 4 GPUs

Docker Deployment

# Pull vLLM image
docker pull vllm/vllm-openai:latest

# Run with GPU access
docker run --gpus all \
  --name vllm-server \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2

Docker Compose for vLLM + Apex

# docker-compose.yml
services:
  vllm:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command:
      - --model
      - meta-llama/Llama-3.1-70B-Instruct
      - --host
      - 0.0.0.0
      - --port
      - "8000"

  kali-apex:
    build: ./container
    depends_on:
      - vllm
    environment:
      - LOCAL_MODEL_URL=http://vllm:8000/v1
    network_mode: bridge
Start both services:
docker compose up -d
docker compose exec kali-apex bash
pensar pentest --target https://example.com --model meta-llama/Llama-3.1-70B-Instruct

Multi-Node Setup

For large models across multiple machines:
# Node 1 (head node)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2

# Apex points to head node
export LOCAL_MODEL_URL="http://head-node:8000/v1"

Verifying vLLM Setup

Test the Endpoint

curl http://localhost:8000/v1/models
Expected output:
{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.1-70B-Instruct",
      "object": "model",
      "created": 1234567890,
      "owned_by": "vllm"
    }
  ]
}

Test Completion

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "What is SQL injection?"}],
    "max_tokens": 100
  }'

Check Apex Detection

# Run Apex doctor
pensar doctor
Expected output:
AI Provider Configuration:
- vLLM (local): ✓ Configured (http://localhost:8000/v1)

Performance Tuning

GPU Memory Optimization

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --gpu-memory-utilization 0.9 \  # Use 90% of VRAM
  --max-model-len 4096 \           # Reduce context length
  --tensor-parallel-size 2         # Split across 2 GPUs

Batch Processing

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --max-num-batched-tokens 8192 \  # Larger batches
  --max-num-seqs 256               # More concurrent requests

Quantization

# AWQ quantization (4-bit)
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
  --quantization awq

# GPTQ quantization (4-bit)
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
  --quantization gptq

Troubleshooting

”LOCAL_MODEL_URL not detected”

Cause: Environment variable not set. Solution:
export LOCAL_MODEL_URL="http://localhost:8000/v1"
# Or add to ~/.bashrc or container/.env

“Connection refused” to vLLM

Cause: vLLM server not running or wrong port. Solution:
  1. Check vLLM is running: curl http://localhost:8000/v1/models
  2. Verify port: netstat -tuln | grep 8000
  3. Check Docker networking if using containers:
    docker network inspect bridge
    

“Out of memory” on vLLM

Cause: Model too large for available VRAM. Solutions:
  1. Use a smaller model (8B instead of 70B)
  2. Enable quantization (INT8 or INT4)
  3. Reduce --gpu-memory-utilization to 0.8
  4. Reduce --max-model-len to limit context size
  5. Add more GPUs with --tensor-parallel-size

”Model not found” in Apex

Cause: Model name mismatch between vLLM and Apex. Solution:
  1. Check the exact model ID from vLLM:
    curl http://localhost:8000/v1/models
    
  2. Use that exact ID in Apex:
    pensar pentest --target https://example.com --model <exact-id>
    

Slow inference

Cause: Model too large, insufficient GPU compute. Solutions:
  1. Use a smaller/faster model (8B or 13B)
  2. Enable quantization
  3. Increase --tensor-parallel-size to use more GPUs
  4. Check GPU utilization: nvidia-smi

Comparing Cloud vs Local Models

FeatureAnthropic (Cloud)vLLM (Local)
Performance⭐⭐⭐⭐⭐ Best⭐⭐⭐ Good
SetupSimple (API key)Complex (GPU server)
CostPer-token chargesHardware + electricity
PrivacyData sent to Anthropic100% local
AvailabilityRequires internetAir-gapped OK
Latency~2-5s per request~1-10s (varies by GPU)
Context Length200k tokens4k-32k (model dependent)
Use Anthropic for production pentests. vLLM is best for:
  • Air-gapped environments
  • High-volume testing (cost savings)
  • Custom fine-tuned models
  • Data sovereignty requirements

Advanced: Fine-Tuning for Security

To create a custom security-focused model:
1

Prepare Training Data

Collect vulnerability reports, exploit descriptions, and security documentation:
[
  {
    "prompt": "Analyze this code for SQL injection vulnerabilities:",
    "completion": "The query concatenates user input directly..."
  }
]
2

Fine-Tune Base Model

Use frameworks like Axolotl or LLaMA Factory:
python -m axolotl train config.yml
3

Serve Fine-Tuned Model

vllm serve ./my-security-model \
  --host 0.0.0.0 \
  --port 8000
4

Use in Apex

export LOCAL_MODEL_URL="http://localhost:8000/v1"
pensar pentest --target https://example.com --model my-security-model

Alternative Local Inference Engines

vLLM is recommended, but Apex also supports other OpenAI-compatible servers:

Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.1:70b

# Serve with OpenAI compatibility
ollama serve --host 0.0.0.0:11434

# Configure Apex
export LOCAL_MODEL_URL="http://localhost:11434/v1"

LM Studio

  1. Download LM Studio
  2. Load a model
  3. Start the local server (port 1234 by default)
  4. Configure Apex:
    export LOCAL_MODEL_URL="http://localhost:1234/v1"
    

Text Generation Inference (TGI)

docker run --gpus all \
  -p 8000:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-70B-Instruct

export LOCAL_MODEL_URL="http://localhost:8000/v1"

Next Steps

Blackbox Testing

Run blackbox pentests with your local model

Whitebox Testing

Analyze source code using local inference

Docker Setup

Deploy vLLM + Apex in containers

Authentication

Test auth flows with local models

Build docs developers (and LLMs) love