Ollama

PentAGI supports Ollama for local LLM inference, providing zero-cost operation, enhanced privacy, and complete control over model deployment.

Environment Variables

OLLAMA_SERVER_URL

string

required

URL of your Ollama server (e.g., http://localhost:11434 or http://ollama-server:11434).

OLLAMA_SERVER_MODEL

string

default:"llama3.1:8b-instruct-q8_0"

Default model for inference. Must be available on your Ollama server.

OLLAMA_SERVER_CONFIG_PATH

string

Path to custom agent configuration YAML file for per-agent model selection.

OLLAMA_SERVER_PULL_MODELS_TIMEOUT

integer

default:"600"

Timeout for model downloads in seconds (default: 10 minutes).

OLLAMA_SERVER_PULL_MODELS_ENABLED

boolean

default:"false"

Automatically download missing models on startup.

OLLAMA_SERVER_LOAD_MODELS_ENABLED

boolean

default:"false"

Query Ollama server for available models on startup.

Configuration Examples

Basic Setup

# Basic Ollama setup with default model
OLLAMA_SERVER_URL=http://localhost:11434
OLLAMA_SERVER_MODEL=llama3.1:8b-instruct-q8_0

Production Setup

# Production setup with auto-pull and model discovery
OLLAMA_SERVER_URL=http://ollama-server:11434
OLLAMA_SERVER_PULL_MODELS_ENABLED=true
OLLAMA_SERVER_PULL_MODELS_TIMEOUT=900
OLLAMA_SERVER_LOAD_MODELS_ENABLED=true

Custom Configuration

# Custom configuration with agent-specific models
OLLAMA_SERVER_CONFIG_PATH=/path/to/ollama-config.yml

# Default configuration file inside docker container
OLLAMA_SERVER_CONFIG_PATH=/opt/pentagi/conf/ollama-llama318b.provider.yml

Performance Considerations

Feature	Performance Impact
Model Discovery (`LOAD_MODELS_ENABLED=true`)	+1-2s startup latency
Auto-pull (`PULL_MODELS_ENABLED=true`)	+several minutes first startup
Static Config (both disabled)	Fastest startup

For fastest startup, disable both discovery and auto-pull flags and specify models directly in your configuration file.

Supported Models

Ollama supports a wide range of open-source models:

Meta Llama

llama3.1:8b - Efficient 8B parameter model
llama3.1:70b - Powerful 70B parameter model
llama3.1:405b - Largest Llama model (requires significant VRAM)
llama3.2:1b - Ultra-lightweight model
llama3.2:3b - Small efficient model

Alibaba Qwen

qwen3:32b-fp16 - High-quality 32B model in FP16
qwen3:72b - Advanced reasoning capabilities
qwq:32b-fp16 - Question-answering optimized (71.3 GB VRAM)

Other Models

mistral:7b - Fast and capable
mixtral:8x7b - Mixture of experts model
phi3:mini - Microsoft’s compact model
codellama:13b - Code-specialized model

See Ollama Model Library for the complete list.

Creating Custom Models with Extended Context

PentAGI requires models with larger context windows than default Ollama configurations. You must create custom models with increased num_ctx parameter through Modelfiles.The num_ctx parameter:

Can only be set during model creation via Modelfile
Cannot be changed after model creation
Cannot be overridden at runtime

While typical agent workflows consume around 64K tokens, PentAGI uses 110K context size for safety margin and handling complex penetration testing scenarios.

Example: Qwen3 32B FP16 with Extended Context

Create a Modelfile named Modelfile_qwen3_32b_fp16_tc:

FROM qwen3:32b-fp16
PARAMETER num_ctx 110000
PARAMETER temperature 0.3
PARAMETER top_p 0.8
PARAMETER min_p 0.0
PARAMETER top_k 20
PARAMETER repeat_penalty 1.1

Build the custom model:

ollama create qwen3:32b-fp16-tc -f Modelfile_qwen3_32b_fp16_tc

Example: QwQ 32B FP16 with Extended Context

Create a Modelfile named Modelfile_qwq_32b_fp16_tc:

FROM qwq:32b-fp16
PARAMETER num_ctx 110000
PARAMETER temperature 0.2
PARAMETER top_p 0.7
PARAMETER min_p 0.0
PARAMETER top_k 40
PARAMETER repeat_penalty 1.2

Build the custom model:

ollama create qwq:32b-fp16-tc -f Modelfile_qwq_32b_fp16_tc

The QwQ 32B FP16 model requires approximately 71.3 GB VRAM for inference. Ensure your system has sufficient GPU memory before attempting to use this model.

These custom models are referenced in pre-built provider configuration files:

ollama-qwen332b-fp16-tc.provider.yml
ollama-qwq32b-fp16-tc.provider.yml

These files are included in the Docker image at /opt/pentagi/conf/.

Model Configuration Example

Example ollama-llama318b.provider.yml configuration:

simple:
  model: "llama3.1:8b"
  temperature: 0.2
  top_p: 0.3
  n: 1
  max_tokens: 4000

simple_json:
  model: "llama3.1:8b"
  temperature: 0.1
  top_p: 0.2
  n: 1
  max_tokens: 4000

primary_agent:
  model: "llama3.1:8b"
  temperature: 0.2
  top_p: 0.3
  n: 1
  max_tokens: 4000

coder:
  model: "llama3.1:8b"
  temperature: 0.1
  top_p: 0.2
  n: 1
  max_tokens: 6000

pentester:
  model: "llama3.1:8b"
  temperature: 0.3
  top_p: 0.4
  n: 1
  max_tokens: 8000

See Custom Providers for full YAML structure documentation.

Hardware Requirements

Minimum requirements vary by model size:

Model Size	VRAM Required	RAM Required	Use Case
1B-3B params	2-4 GB	8 GB	Lightweight tasks
7B-8B params	6-8 GB	16 GB	General purpose
13B-14B params	12-16 GB	32 GB	Advanced tasks
32B params (FP16)	64+ GB	64 GB	High-quality inference
70B params	48-80 GB	128 GB	Enterprise workloads
405B params	200+ GB	256+ GB	Research/benchmarking

For best performance:

Use GPU acceleration (NVIDIA CUDA, AMD ROCm, or Apple Metal)
Enable memory mapping for models larger than VRAM
Use quantized models (Q4, Q5, Q8) to reduce memory footprint

Model Management

List Available Models

ollama list

Pull a Model

ollama pull llama3.1:8b

Remove a Model

ollama rm llama3.1:8b

Show Model Info

ollama show llama3.1:8b

Deployment Options

Local Deployment

Run Ollama on the same machine as PentAGI:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve

# Configure PentAGI
OLLAMA_SERVER_URL=http://localhost:11434

Docker Deployment

Run Ollama in a container:

docker run -d \
  --name ollama \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama

# Configure PentAGI
OLLAMA_SERVER_URL=http://ollama:11434

Remote Deployment

Run Ollama on a dedicated GPU server:

# On GPU server
ollama serve --host 0.0.0.0

# Configure PentAGI
OLLAMA_SERVER_URL=http://gpu-server:11434

Performance Optimization

GPU Acceleration

Ollama automatically uses available GPU:

NVIDIA: CUDA acceleration
AMD: ROCm support
Apple Silicon: Metal acceleration

Parallel Requests

Configure concurrent model loading:

# Set environment variable for Ollama
OLLAMA_NUM_PARALLEL=4

Context Size

Balance context size and memory usage:

FROM llama3.1:8b
PARAMETER num_ctx 32768  # Reduce if memory constrained

Troubleshooting

Connection Errors

If PentAGI cannot connect to Ollama:

Verify Ollama is running: curl http://localhost:11434/api/tags
Check firewall settings allow port 11434
Ensure OLLAMA_SERVER_URL is correct in .env

Model Not Found

If model is not available:

List models: ollama list
Pull model: ollama pull model-name
Enable auto-pull: OLLAMA_SERVER_PULL_MODELS_ENABLED=true

Out of Memory

If running out of VRAM:

Use smaller quantized model (Q4, Q5 instead of FP16)
Reduce num_ctx in Modelfile
Use CPU inference with OLLAMA_CPU_ONLY=1
Upgrade GPU or use remote GPU server

Slow Inference

If inference is slow:

Verify GPU acceleration is enabled: ollama ps
Use appropriate model size for your hardware
Enable concurrent requests: OLLAMA_NUM_PARALLEL=4
Consider using smaller model or quantization

Benefits of Local Deployment

Zero Cost: No API fees or token costs
Privacy: All data stays on your infrastructure
Offline Operation: Works without internet connectivity
Customization: Full control over models and parameters
No Rate Limits: Limited only by your hardware
Compliance: Meet data residency requirements

LLM Providers

Observability

Knowledge Graph

Environment Variables

Configuration Examples

Basic Setup

Production Setup

Custom Configuration

Performance Considerations

Supported Models

Meta Llama

Alibaba Qwen

Other Models

Creating Custom Models with Extended Context

Example: Qwen3 32B FP16 with Extended Context

Example: QwQ 32B FP16 with Extended Context

Model Configuration Example

Hardware Requirements

Model Management

List Available Models

Pull a Model

Remove a Model

Show Model Info

Deployment Options

Local Deployment

Docker Deployment

Remote Deployment

Performance Optimization

GPU Acceleration

Parallel Requests

Context Size

Troubleshooting

Connection Errors

Model Not Found

Out of Memory

Slow Inference

Benefits of Local Deployment

Build docs developers (and LLMs) love

LLM Providers

Observability

Knowledge Graph

​Environment Variables

​Configuration Examples

​Basic Setup

​Production Setup

​Custom Configuration

​Performance Considerations

​Supported Models

​Meta Llama

​Alibaba Qwen

​Other Models

​Creating Custom Models with Extended Context

​Example: Qwen3 32B FP16 with Extended Context

​Example: QwQ 32B FP16 with Extended Context

​Model Configuration Example

​Hardware Requirements

​Model Management

​List Available Models

​Pull a Model

​Remove a Model

​Show Model Info

​Deployment Options

​Local Deployment

​Docker Deployment

​Remote Deployment

​Performance Optimization

​GPU Acceleration

​Parallel Requests

​Context Size

​Troubleshooting

​Connection Errors

​Model Not Found

​Out of Memory

​Slow Inference

​Benefits of Local Deployment

Build docs developers (and LLMs) love

Environment Variables

Configuration Examples

Basic Setup

Production Setup

Custom Configuration

Performance Considerations

Supported Models

Meta Llama

Alibaba Qwen

Other Models

Creating Custom Models with Extended Context

Example: Qwen3 32B FP16 with Extended Context

Example: QwQ 32B FP16 with Extended Context

Model Configuration Example

Hardware Requirements

Model Management

List Available Models

Pull a Model

Remove a Model

Show Model Info

Deployment Options

Local Deployment

Docker Deployment

Remote Deployment

Performance Optimization

GPU Acceleration

Parallel Requests

Context Size

Troubleshooting

Connection Errors

Model Not Found

Out of Memory

Slow Inference

Benefits of Local Deployment