Skip to main content
PentAGI supports Ollama for local LLM inference, providing zero-cost operation, enhanced privacy, and complete control over model deployment.

Environment Variables

OLLAMA_SERVER_URL
string
required
URL of your Ollama server (e.g., http://localhost:11434 or http://ollama-server:11434).
OLLAMA_SERVER_MODEL
string
default:"llama3.1:8b-instruct-q8_0"
Default model for inference. Must be available on your Ollama server.
OLLAMA_SERVER_CONFIG_PATH
string
Path to custom agent configuration YAML file for per-agent model selection.
OLLAMA_SERVER_PULL_MODELS_TIMEOUT
integer
default:"600"
Timeout for model downloads in seconds (default: 10 minutes).
OLLAMA_SERVER_PULL_MODELS_ENABLED
boolean
default:"false"
Automatically download missing models on startup.
OLLAMA_SERVER_LOAD_MODELS_ENABLED
boolean
default:"false"
Query Ollama server for available models on startup.

Configuration Examples

Basic Setup

# Basic Ollama setup with default model
OLLAMA_SERVER_URL=http://localhost:11434
OLLAMA_SERVER_MODEL=llama3.1:8b-instruct-q8_0

Production Setup

# Production setup with auto-pull and model discovery
OLLAMA_SERVER_URL=http://ollama-server:11434
OLLAMA_SERVER_PULL_MODELS_ENABLED=true
OLLAMA_SERVER_PULL_MODELS_TIMEOUT=900
OLLAMA_SERVER_LOAD_MODELS_ENABLED=true

Custom Configuration

# Custom configuration with agent-specific models
OLLAMA_SERVER_CONFIG_PATH=/path/to/ollama-config.yml

# Default configuration file inside docker container
OLLAMA_SERVER_CONFIG_PATH=/opt/pentagi/conf/ollama-llama318b.provider.yml

Performance Considerations

FeaturePerformance Impact
Model Discovery (LOAD_MODELS_ENABLED=true)+1-2s startup latency
Auto-pull (PULL_MODELS_ENABLED=true)+several minutes first startup
Static Config (both disabled)Fastest startup
For fastest startup, disable both discovery and auto-pull flags and specify models directly in your configuration file.

Supported Models

Ollama supports a wide range of open-source models:

Meta Llama

  • llama3.1:8b - Efficient 8B parameter model
  • llama3.1:70b - Powerful 70B parameter model
  • llama3.1:405b - Largest Llama model (requires significant VRAM)
  • llama3.2:1b - Ultra-lightweight model
  • llama3.2:3b - Small efficient model

Alibaba Qwen

  • qwen3:32b-fp16 - High-quality 32B model in FP16
  • qwen3:72b - Advanced reasoning capabilities
  • qwq:32b-fp16 - Question-answering optimized (71.3 GB VRAM)

Other Models

  • mistral:7b - Fast and capable
  • mixtral:8x7b - Mixture of experts model
  • phi3:mini - Microsoft’s compact model
  • codellama:13b - Code-specialized model
See Ollama Model Library for the complete list.

Creating Custom Models with Extended Context

PentAGI requires models with larger context windows than default Ollama configurations. You must create custom models with increased num_ctx parameter through Modelfiles.The num_ctx parameter:
  • Can only be set during model creation via Modelfile
  • Cannot be changed after model creation
  • Cannot be overridden at runtime
While typical agent workflows consume around 64K tokens, PentAGI uses 110K context size for safety margin and handling complex penetration testing scenarios.

Example: Qwen3 32B FP16 with Extended Context

Create a Modelfile named Modelfile_qwen3_32b_fp16_tc:
FROM qwen3:32b-fp16
PARAMETER num_ctx 110000
PARAMETER temperature 0.3
PARAMETER top_p 0.8
PARAMETER min_p 0.0
PARAMETER top_k 20
PARAMETER repeat_penalty 1.1
Build the custom model:
ollama create qwen3:32b-fp16-tc -f Modelfile_qwen3_32b_fp16_tc

Example: QwQ 32B FP16 with Extended Context

Create a Modelfile named Modelfile_qwq_32b_fp16_tc:
FROM qwq:32b-fp16
PARAMETER num_ctx 110000
PARAMETER temperature 0.2
PARAMETER top_p 0.7
PARAMETER min_p 0.0
PARAMETER top_k 40
PARAMETER repeat_penalty 1.2
Build the custom model:
ollama create qwq:32b-fp16-tc -f Modelfile_qwq_32b_fp16_tc
The QwQ 32B FP16 model requires approximately 71.3 GB VRAM for inference. Ensure your system has sufficient GPU memory before attempting to use this model.
These custom models are referenced in pre-built provider configuration files:
  • ollama-qwen332b-fp16-tc.provider.yml
  • ollama-qwq32b-fp16-tc.provider.yml
These files are included in the Docker image at /opt/pentagi/conf/.

Model Configuration Example

Example ollama-llama318b.provider.yml configuration:
simple:
  model: "llama3.1:8b"
  temperature: 0.2
  top_p: 0.3
  n: 1
  max_tokens: 4000

simple_json:
  model: "llama3.1:8b"
  temperature: 0.1
  top_p: 0.2
  n: 1
  max_tokens: 4000

primary_agent:
  model: "llama3.1:8b"
  temperature: 0.2
  top_p: 0.3
  n: 1
  max_tokens: 4000

coder:
  model: "llama3.1:8b"
  temperature: 0.1
  top_p: 0.2
  n: 1
  max_tokens: 6000

pentester:
  model: "llama3.1:8b"
  temperature: 0.3
  top_p: 0.4
  n: 1
  max_tokens: 8000
See Custom Providers for full YAML structure documentation.

Hardware Requirements

Minimum requirements vary by model size:
Model SizeVRAM RequiredRAM RequiredUse Case
1B-3B params2-4 GB8 GBLightweight tasks
7B-8B params6-8 GB16 GBGeneral purpose
13B-14B params12-16 GB32 GBAdvanced tasks
32B params (FP16)64+ GB64 GBHigh-quality inference
70B params48-80 GB128 GBEnterprise workloads
405B params200+ GB256+ GBResearch/benchmarking
For best performance:
  • Use GPU acceleration (NVIDIA CUDA, AMD ROCm, or Apple Metal)
  • Enable memory mapping for models larger than VRAM
  • Use quantized models (Q4, Q5, Q8) to reduce memory footprint

Model Management

List Available Models

ollama list

Pull a Model

ollama pull llama3.1:8b

Remove a Model

ollama rm llama3.1:8b

Show Model Info

ollama show llama3.1:8b

Deployment Options

Local Deployment

Run Ollama on the same machine as PentAGI:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve

# Configure PentAGI
OLLAMA_SERVER_URL=http://localhost:11434

Docker Deployment

Run Ollama in a container:
docker run -d \
  --name ollama \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama

# Configure PentAGI
OLLAMA_SERVER_URL=http://ollama:11434

Remote Deployment

Run Ollama on a dedicated GPU server:
# On GPU server
ollama serve --host 0.0.0.0

# Configure PentAGI
OLLAMA_SERVER_URL=http://gpu-server:11434

Performance Optimization

GPU Acceleration

Ollama automatically uses available GPU:
  • NVIDIA: CUDA acceleration
  • AMD: ROCm support
  • Apple Silicon: Metal acceleration

Parallel Requests

Configure concurrent model loading:
# Set environment variable for Ollama
OLLAMA_NUM_PARALLEL=4

Context Size

Balance context size and memory usage:
FROM llama3.1:8b
PARAMETER num_ctx 32768  # Reduce if memory constrained

Troubleshooting

Connection Errors

If PentAGI cannot connect to Ollama:
  1. Verify Ollama is running: curl http://localhost:11434/api/tags
  2. Check firewall settings allow port 11434
  3. Ensure OLLAMA_SERVER_URL is correct in .env

Model Not Found

If model is not available:
  1. List models: ollama list
  2. Pull model: ollama pull model-name
  3. Enable auto-pull: OLLAMA_SERVER_PULL_MODELS_ENABLED=true

Out of Memory

If running out of VRAM:
  1. Use smaller quantized model (Q4, Q5 instead of FP16)
  2. Reduce num_ctx in Modelfile
  3. Use CPU inference with OLLAMA_CPU_ONLY=1
  4. Upgrade GPU or use remote GPU server

Slow Inference

If inference is slow:
  1. Verify GPU acceleration is enabled: ollama ps
  2. Use appropriate model size for your hardware
  3. Enable concurrent requests: OLLAMA_NUM_PARALLEL=4
  4. Consider using smaller model or quantization

Benefits of Local Deployment

  • Zero Cost: No API fees or token costs
  • Privacy: All data stays on your infrastructure
  • Offline Operation: Works without internet connectivity
  • Customization: Full control over models and parameters
  • No Rate Limits: Limited only by your hardware
  • Compliance: Meet data residency requirements

Build docs developers (and LLMs) love