Overview
Optimizing PentAGI performance requires balancing LLM provider costs, local resource usage, and response latency. This guide covers practical strategies for different deployment scenarios.System Requirements
Minimum Requirements
- CPU: 2 vCPU
- RAM: 4GB
- Storage: 20GB free disk space
- Network: Internet access for LLM APIs
Recommended for Production
- CPU: 4-8 vCPU
- RAM: 16GB+
- Storage: 100GB SSD
- GPU: Optional, required for local Ollama models
- Network: Low-latency connection to LLM provider
LLM Provider Performance
Response Time by Provider
| Provider | Model | Avg Latency | Use Case |
|---|---|---|---|
| OpenAI | GPT-4 Turbo | 2-4s | Balanced performance |
| OpenAI | GPT-4o-mini | 1-2s | Fast iteration |
| Anthropic | Claude 3.5 Sonnet | 2-5s | Complex reasoning |
| Anthropic | Claude 3.5 Haiku | 1-2s | Quick responses |
| Gemini 2.5 Flash | 1-3s | Cost-effective speed | |
| AWS Bedrock | Claude Sonnet | 3-6s | Enterprise deployment |
| Ollama | Llama 3.1 8B | 0.5-2s | Local, GPU-dependent |
Cost Optimization
- Cloud Providers
- Ollama Local
- Hybrid Approach
Strategy: Use cheaper models for simple tasksExpected Savings: 60-80% cost reduction compared to using GPT-4 for all tasks
Docker Configuration
Container Resource Limits
Optimize Docker resource allocation indocker-compose.yml:
Network Optimization
For production deployments, use separate networks:Volume Performance
Use named volumes for better I/O performance:Database Optimization
PostgreSQL with pgvector
Tune PostgreSQL for vector operations:Embedding Batch Size
Optimize embedding generation:Summarization Performance
Optimize Summarizer Settings
Concurrent Summarization
The summarizer uses goroutines for parallel processing. Increase system resources for better performance:Monitoring and Observability
Langfuse Performance Tracking
Enable detailed performance metrics:- Token usage per agent type
- Latency distribution
- Error rates
- Cost per operation
OpenTelemetry Integration
Track system-level metrics:http://localhost:3000 for:
- Request throughput
- Response times
- Resource utilization
- Error tracking
Scaling Strategies
Vertical Scaling
Horizontal Scaling
For production workloads, use a distributed architecture:Worker Node Isolation
For security-sensitive deployments, isolate worker operations:Network Performance
Proxy Configuration
For isolated environments:Scraper Optimization
Provider-Specific Tuning
OpenAI
OpenAI
Anthropic
Anthropic
AWS Bedrock
AWS Bedrock
Ollama
Ollama
Best Practices
Monitor First
Always enable monitoring before optimization to establish baseline metrics
Incremental Changes
Make one change at a time and measure impact before proceeding
Cost vs Performance
Balance response time with API costs based on use case requirements
Test Under Load
Simulate production workloads during testing to validate optimizations
Performance Benchmarks
Typical Workflow Metrics
| Scenario | Duration | Token Usage | Recommended Config |
|---|---|---|---|
| Simple port scan | 2-5 min | 10-20K tokens | GPT-4o-mini, Haiku |
| Web vulnerability scan | 10-20 min | 40-80K tokens | GPT-4 Turbo, Sonnet |
| Network penetration test | 30-60 min | 100-150K tokens | Claude 4, Gemini 2.5 |
| Complex exploit development | 60-120 min | 150-250K tokens | Opus, Gemini 2.5 Pro |
Troubleshooting Performance Issues
High Latency
High Latency
- Check network connectivity to LLM provider
- Verify Docker resource limits aren’t constraining CPU/RAM
- Review Langfuse metrics for slow operations
- Consider switching to faster model (e.g., Haiku, Flash)
High API Costs
High API Costs
- Use smaller models for simple tasks
- Enable aggressive summarization
- Implement caching for repeated queries
- Consider hybrid approach with local Ollama
Memory Issues
Memory Issues
- Increase Docker memory limits
- Reduce
EMBEDDING_BATCH_SIZE - Decrease
SUMMARIZER_KEEP_QA_SECTIONS - Clear old vector store data with
etester flush
Rate Limiting
Rate Limiting
- Increase provider quotas (AWS Bedrock)
- Add retry logic with exponential backoff
- Distribute load across multiple API keys
- Switch to provider with higher limits
Related Resources
Context Management
Optimize token usage and memory
Custom Models
Build Ollama models for local inference
Chain Summarization
Reduce context size efficiently