Skip to main content

Overview

Optimizing PentAGI performance requires balancing LLM provider costs, local resource usage, and response latency. This guide covers practical strategies for different deployment scenarios.

System Requirements

Minimum Requirements

  • CPU: 2 vCPU
  • RAM: 4GB
  • Storage: 20GB free disk space
  • Network: Internet access for LLM APIs
  • CPU: 4-8 vCPU
  • RAM: 16GB+
  • Storage: 100GB SSD
  • GPU: Optional, required for local Ollama models
  • Network: Low-latency connection to LLM provider

LLM Provider Performance

Response Time by Provider

ProviderModelAvg LatencyUse Case
OpenAIGPT-4 Turbo2-4sBalanced performance
OpenAIGPT-4o-mini1-2sFast iteration
AnthropicClaude 3.5 Sonnet2-5sComplex reasoning
AnthropicClaude 3.5 Haiku1-2sQuick responses
GoogleGemini 2.5 Flash1-3sCost-effective speed
AWS BedrockClaude Sonnet3-6sEnterprise deployment
OllamaLlama 3.1 8B0.5-2sLocal, GPU-dependent

Cost Optimization

Strategy: Use cheaper models for simple tasks
# Use GPT-4o-mini for simple completions
simple:
  model: "gpt-4o-mini"
  max_tokens: 2000

# Use GPT-4 Turbo for complex analysis
pentester:
  model: "gpt-4-turbo"
  max_tokens: 8000
Expected Savings: 60-80% cost reduction compared to using GPT-4 for all tasks

Docker Configuration

Container Resource Limits

Optimize Docker resource allocation in docker-compose.yml:
services:
  pentagi:
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 8G
        reservations:
          cpus: '2.0'
          memory: 4G

Network Optimization

For production deployments, use separate networks:
networks:
  pentagi-network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16

Volume Performance

Use named volumes for better I/O performance:
volumes:
  pentagi-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/ssd/pentagi-data

Database Optimization

PostgreSQL with pgvector

Tune PostgreSQL for vector operations:
-- Increase shared memory
ALTER SYSTEM SET shared_buffers = '4GB';
ALTER SYSTEM SET effective_cache_size = '12GB';
ALTER SYSTEM SET maintenance_work_mem = '1GB';
ALTER SYSTEM SET work_mem = '256MB';

-- Optimize for vector search
ALTER SYSTEM SET max_parallel_workers_per_gather = 4;
ALTER SYSTEM SET random_page_cost = 1.1;

Embedding Batch Size

Optimize embedding generation:
# For fast embedding providers (OpenAI)
EMBEDDING_BATCH_SIZE=100

# For slower providers or rate limits
EMBEDDING_BATCH_SIZE=25

# Strip newlines to reduce token count
EMBEDDING_STRIP_NEW_LINES=true

Summarization Performance

Optimize Summarizer Settings

# Faster summarization with smaller sections
SUMMARIZER_LAST_SEC_BYTES=40960  # 40KB (faster)
SUMMARIZER_MAX_BP_BYTES=8192     # 8KB (more frequent)

# Slower but more context-aware
SUMMARIZER_LAST_SEC_BYTES=102400 # 100KB (slower)
SUMMARIZER_MAX_BP_BYTES=32768    # 32KB (less frequent)

Concurrent Summarization

The summarizer uses goroutines for parallel processing. Increase system resources for better performance:
services:
  pentagi:
    environment:
      GOMAXPROCS: 4  # Match CPU cores

Monitoring and Observability

Langfuse Performance Tracking

Enable detailed performance metrics:
LANGFUSE_BASE_URL=http://langfuse-web:3000
LANGFUSE_PUBLIC_KEY=your_public_key
LANGFUSE_SECRET_KEY=your_secret_key
Monitor in Langfuse UI:
  • Token usage per agent type
  • Latency distribution
  • Error rates
  • Cost per operation

OpenTelemetry Integration

Track system-level metrics:
OTEL_HOST=otelcol:8148
Access Grafana dashboards at http://localhost:3000 for:
  • Request throughput
  • Response times
  • Resource utilization
  • Error tracking

Scaling Strategies

Vertical Scaling

1

Identify Bottleneck

Use monitoring to determine if CPU, RAM, or network is the constraint
2

Increase Resources

Adjust Docker resource limits or upgrade VM instance
3

Validate Improvement

Measure latency and throughput after changes

Horizontal Scaling

For production workloads, use a distributed architecture:

Worker Node Isolation

For security-sensitive deployments, isolate worker operations:
# Main node
DOCKER_HOST=tcp://worker-node:2376
DOCKER_TLS_VERIFY=1
DOCKER_CERT_PATH=/path/to/certs

# Worker node
DOCKER_INSIDE=true
DOCKER_NET_ADMIN=true
See Worker Node Setup for detailed configuration.

Network Performance

Proxy Configuration

For isolated environments:
# Global proxy for all LLM providers
PROXY_URL=http://proxy.internal:8080

# SSL configuration
EXTERNAL_SSL_CA_PATH=/opt/pentagi/ssl/ca-bundle.pem
EXTERNAL_SSL_INSECURE=false

Scraper Optimization

# Public scraper for external URLs
SCRAPER_PUBLIC_URL=https://public-scraper.example.com

# Private scraper for internal targets
SCRAPER_PRIVATE_URL=https://user:pass@scraper-internal/
LOCAL_SCRAPER_MAX_CONCURRENT_SESSIONS=10

Provider-Specific Tuning

# Use streaming for faster perceived response
OPENAI_STREAM=true

# Adjust timeout for large responses
OPENAI_TIMEOUT=60

# Use GPT-4o-mini for cost savings
simple:
  model: "gpt-4o-mini"
# Claude 3.5 Haiku for speed
simple:
  model: "claude-3-5-haiku-20241022"

# Claude 4 Opus for complex tasks
pentester:
  model: "claude-4-opus-20250514"
# Use provisioned throughput for consistent latency
BEDROCK_REGION=us-east-1

# Request quota increases for high-throughput
# Default: 2 req/min for Claude Sonnet
# Recommended: 50+ req/min
# Disable model loading for faster startup
OLLAMA_SERVER_LOAD_MODELS_ENABLED=false

# Increase pull timeout for large models
OLLAMA_SERVER_PULL_MODELS_TIMEOUT=900

# Use quantized models for speed
OLLAMA_SERVER_MODEL=llama3.1:8b-instruct-q8_0

Best Practices

Monitor First

Always enable monitoring before optimization to establish baseline metrics

Incremental Changes

Make one change at a time and measure impact before proceeding

Cost vs Performance

Balance response time with API costs based on use case requirements

Test Under Load

Simulate production workloads during testing to validate optimizations

Performance Benchmarks

Typical Workflow Metrics

ScenarioDurationToken UsageRecommended Config
Simple port scan2-5 min10-20K tokensGPT-4o-mini, Haiku
Web vulnerability scan10-20 min40-80K tokensGPT-4 Turbo, Sonnet
Network penetration test30-60 min100-150K tokensClaude 4, Gemini 2.5
Complex exploit development60-120 min150-250K tokensOpus, Gemini 2.5 Pro

Troubleshooting Performance Issues

  1. Check network connectivity to LLM provider
  2. Verify Docker resource limits aren’t constraining CPU/RAM
  3. Review Langfuse metrics for slow operations
  4. Consider switching to faster model (e.g., Haiku, Flash)
  1. Use smaller models for simple tasks
  2. Enable aggressive summarization
  3. Implement caching for repeated queries
  4. Consider hybrid approach with local Ollama
  1. Increase Docker memory limits
  2. Reduce EMBEDDING_BATCH_SIZE
  3. Decrease SUMMARIZER_KEEP_QA_SECTIONS
  4. Clear old vector store data with etester flush
  1. Increase provider quotas (AWS Bedrock)
  2. Add retry logic with exponential backoff
  3. Distribute load across multiple API keys
  4. Switch to provider with higher limits

Context Management

Optimize token usage and memory

Custom Models

Build Ollama models for local inference

Chain Summarization

Reduce context size efficiently

Build docs developers (and LLMs) love