Performance Tuning

Overview

Optimizing PentAGI performance requires balancing LLM provider costs, local resource usage, and response latency. This guide covers practical strategies for different deployment scenarios.

System Requirements

Minimum Requirements

CPU: 2 vCPU
RAM: 4GB
Storage: 20GB free disk space
Network: Internet access for LLM APIs

Recommended for Production

CPU: 4-8 vCPU
RAM: 16GB+
Storage: 100GB SSD
GPU: Optional, required for local Ollama models
Network: Low-latency connection to LLM provider

LLM Provider Performance

Response Time by Provider

Provider	Model	Avg Latency	Use Case
OpenAI	GPT-4 Turbo	2-4s	Balanced performance
OpenAI	GPT-4o-mini	1-2s	Fast iteration
Anthropic	Claude 3.5 Sonnet	2-5s	Complex reasoning
Anthropic	Claude 3.5 Haiku	1-2s	Quick responses
Google	Gemini 2.5 Flash	1-3s	Cost-effective speed
AWS Bedrock	Claude Sonnet	3-6s	Enterprise deployment
Ollama	Llama 3.1 8B	0.5-2s	Local, GPU-dependent

Cost Optimization

Cloud Providers
Ollama Local
Hybrid Approach

Strategy: Use cheaper models for simple tasks

# Use GPT-4o-mini for simple completions
simple:
  model: "gpt-4o-mini"
  max_tokens: 2000

# Use GPT-4 Turbo for complex analysis
pentester:
  model: "gpt-4-turbo"
  max_tokens: 8000

Expected Savings: 60-80% cost reduction compared to using GPT-4 for all tasks

Strategy: Use local models for zero marginal cost

OLLAMA_SERVER_URL=http://localhost:11434
OLLAMA_SERVER_MODEL=llama3.1:8b-instruct-tc

Hardware Investment:

8B models: $800-1500 GPU (RTX 4070 Ti)
32B models: $2000-4000 GPU (RTX 4090)
70B models: $5000-10000 GPU (A100)

Break-even: ~$200-500 in cloud costs per month

Strategy: Use local for bulk, cloud for critical

# Local Ollama for simple tasks
simple:
  model: "llama3.1:8b-instruct-tc"

# OpenAI for complex penetration testing
pentester:
  model: "gpt-4-turbo"

Optimal Balance: 70-80% local execution, 20-30% cloud

Docker Configuration

Container Resource Limits

Optimize Docker resource allocation in docker-compose.yml:

services:
  pentagi:
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 8G
        reservations:
          cpus: '2.0'
          memory: 4G

Network Optimization

For production deployments, use separate networks:

networks:
  pentagi-network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16

Volume Performance

Use named volumes for better I/O performance:

volumes:
  pentagi-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/ssd/pentagi-data

Database Optimization

PostgreSQL with pgvector

Tune PostgreSQL for vector operations:

-- Increase shared memory
ALTER SYSTEM SET shared_buffers = '4GB';
ALTER SYSTEM SET effective_cache_size = '12GB';
ALTER SYSTEM SET maintenance_work_mem = '1GB';
ALTER SYSTEM SET work_mem = '256MB';

-- Optimize for vector search
ALTER SYSTEM SET max_parallel_workers_per_gather = 4;
ALTER SYSTEM SET random_page_cost = 1.1;

Embedding Batch Size

Optimize embedding generation:

# For fast embedding providers (OpenAI)
EMBEDDING_BATCH_SIZE=100

# For slower providers or rate limits
EMBEDDING_BATCH_SIZE=25

# Strip newlines to reduce token count
EMBEDDING_STRIP_NEW_LINES=true

Summarization Performance

Optimize Summarizer Settings

# Faster summarization with smaller sections
SUMMARIZER_LAST_SEC_BYTES=40960  # 40KB (faster)
SUMMARIZER_MAX_BP_BYTES=8192     # 8KB (more frequent)

# Slower but more context-aware
SUMMARIZER_LAST_SEC_BYTES=102400 # 100KB (slower)
SUMMARIZER_MAX_BP_BYTES=32768    # 32KB (less frequent)

Concurrent Summarization

The summarizer uses goroutines for parallel processing. Increase system resources for better performance:

services:
  pentagi:
    environment:
      GOMAXPROCS: 4  # Match CPU cores

Monitoring and Observability

Langfuse Performance Tracking

Enable detailed performance metrics:

LANGFUSE_BASE_URL=http://langfuse-web:3000
LANGFUSE_PUBLIC_KEY=your_public_key
LANGFUSE_SECRET_KEY=your_secret_key

Monitor in Langfuse UI:

Token usage per agent type
Latency distribution
Error rates
Cost per operation

OpenTelemetry Integration

Track system-level metrics:

OTEL_HOST=otelcol:8148

Access Grafana dashboards at http://localhost:3000 for:

Request throughput
Response times
Resource utilization
Error tracking

Scaling Strategies

Vertical Scaling

Identify Bottleneck

Use monitoring to determine if CPU, RAM, or network is the constraint

Increase Resources

Adjust Docker resource limits or upgrade VM instance

Validate Improvement

Measure latency and throughput after changes

Horizontal Scaling

For production workloads, use a distributed architecture:

Worker Node Isolation

For security-sensitive deployments, isolate worker operations:

# Main node
DOCKER_HOST=tcp://worker-node:2376
DOCKER_TLS_VERIFY=1
DOCKER_CERT_PATH=/path/to/certs

# Worker node
DOCKER_INSIDE=true
DOCKER_NET_ADMIN=true

See Worker Node Setup for detailed configuration.

Network Performance

Proxy Configuration

For isolated environments:

# Global proxy for all LLM providers
PROXY_URL=http://proxy.internal:8080

# SSL configuration
EXTERNAL_SSL_CA_PATH=/opt/pentagi/ssl/ca-bundle.pem
EXTERNAL_SSL_INSECURE=false

Scraper Optimization

# Public scraper for external URLs
SCRAPER_PUBLIC_URL=https://public-scraper.example.com

# Private scraper for internal targets
SCRAPER_PRIVATE_URL=https://user:pass@scraper-internal/
LOCAL_SCRAPER_MAX_CONCURRENT_SESSIONS=10

Provider-Specific Tuning

OpenAI

# Use streaming for faster perceived response
OPENAI_STREAM=true

# Adjust timeout for large responses
OPENAI_TIMEOUT=60

# Use GPT-4o-mini for cost savings
simple:
  model: "gpt-4o-mini"

Anthropic

# Claude 3.5 Haiku for speed
simple:
  model: "claude-3-5-haiku-20241022"

# Claude 4 Opus for complex tasks
pentester:
  model: "claude-4-opus-20250514"

AWS Bedrock

# Use provisioned throughput for consistent latency
BEDROCK_REGION=us-east-1

# Request quota increases for high-throughput
# Default: 2 req/min for Claude Sonnet
# Recommended: 50+ req/min

Ollama

# Disable model loading for faster startup
OLLAMA_SERVER_LOAD_MODELS_ENABLED=false

# Increase pull timeout for large models
OLLAMA_SERVER_PULL_MODELS_TIMEOUT=900

# Use quantized models for speed
OLLAMA_SERVER_MODEL=llama3.1:8b-instruct-q8_0

Best Practices

Monitor First

Always enable monitoring before optimization to establish baseline metrics

Incremental Changes

Make one change at a time and measure impact before proceeding

Cost vs Performance

Balance response time with API costs based on use case requirements

Test Under Load

Simulate production workloads during testing to validate optimizations

Performance Benchmarks

Typical Workflow Metrics

Scenario	Duration	Token Usage	Recommended Config
Simple port scan	2-5 min	10-20K tokens	GPT-4o-mini, Haiku
Web vulnerability scan	10-20 min	40-80K tokens	GPT-4 Turbo, Sonnet
Network penetration test	30-60 min	100-150K tokens	Claude 4, Gemini 2.5
Complex exploit development	60-120 min	150-250K tokens	Opus, Gemini 2.5 Pro

Troubleshooting Performance Issues

High Latency

Check network connectivity to LLM provider
Verify Docker resource limits aren’t constraining CPU/RAM
Review Langfuse metrics for slow operations
Consider switching to faster model (e.g., Haiku, Flash)

High API Costs

Use smaller models for simple tasks
Enable aggressive summarization
Implement caching for repeated queries
Consider hybrid approach with local Ollama

Memory Issues

Increase Docker memory limits
Reduce EMBEDDING_BATCH_SIZE
Decrease SUMMARIZER_KEEP_QA_SECTIONS
Clear old vector store data with etester flush

Rate Limiting

Increase provider quotas (AWS Bedrock)
Add retry logic with exponential backoff
Distribute load across multiple API keys
Switch to provider with higher limits

Context Management

Optimize token usage and memory

Custom Models

Build Ollama models for local inference

Chain Summarization

Reduce context size efficiently

Setup Guides

Usage Guides

Advanced

​Overview

​System Requirements

​Minimum Requirements

​Recommended for Production

​LLM Provider Performance

​Response Time by Provider

​Cost Optimization

​Docker Configuration

​Container Resource Limits

​Network Optimization

​Volume Performance

​Database Optimization

​PostgreSQL with pgvector

​Embedding Batch Size

​Summarization Performance

​Optimize Summarizer Settings

​Concurrent Summarization

​Monitoring and Observability

​Langfuse Performance Tracking

​OpenTelemetry Integration

​Scaling Strategies

​Vertical Scaling

​Horizontal Scaling

​Worker Node Isolation

​Network Performance

​Proxy Configuration

​Scraper Optimization

​Provider-Specific Tuning

​Best Practices