Skip to main content

Overview

PentAGI requires models with larger context windows than the default Ollama configurations. You can create custom models with increased context size through Modelfiles to handle complex penetration testing scenarios.
The num_ctx parameter can only be set during model creation via Modelfile. It cannot be changed after model creation or overridden at runtime.

Why Extended Context?

While typical agent workflows consume around 64K tokens, PentAGI uses 110K context size for:
  • Safety margin: Handle unexpected context growth during long sessions
  • Complex scenarios: Support multi-step penetration testing workflows
  • Tool call preservation: Maintain full tool execution history
  • Reasoning chains: Preserve extended thinking content from providers

Modelfile Basics

A Modelfile defines model configuration using a simple syntax:
FROM base-model-name
PARAMETER parameter_name value

Key Parameters

ParameterDescriptionRecommended Value
num_ctxContext window size110000 (110K tokens)
temperatureRandomness in output0.2-0.3 for pentesting
top_pNucleus sampling0.7-0.8
top_kTop-k sampling20-40
repeat_penaltyPenalize repetition1.1-1.2
min_pMinimum probability0.0

Example: Qwen3 32B with Extended Context

Qwen3 is a powerful model for security analysis and code generation.

Create the Modelfile

Create a file named Modelfile_qwen3_32b_fp16_tc:
FROM qwen3:32b-fp16
PARAMETER num_ctx 110000
PARAMETER temperature 0.3
PARAMETER top_p 0.8
PARAMETER min_p 0.0
PARAMETER top_k 20
PARAMETER repeat_penalty 1.1

Build the Model

# First, pull the base model
ollama pull qwen3:32b-fp16

# Build custom model with extended context
ollama create qwen3:32b-fp16-tc -f Modelfile_qwen3_32b_fp16_tc

# Verify the model was created
ollama list | grep qwen3

Configure PentAGI

Add to your .env file:
OLLAMA_SERVER_URL=http://localhost:11434
OLLAMA_SERVER_MODEL=qwen3:32b-fp16-tc
OLLAMA_SERVER_CONFIG_PATH=/opt/pentagi/conf/ollama-qwen332b-fp16-tc.provider.yml

Example: QwQ 32B with Extended Context

QwQ is optimized for reasoning and complex problem-solving tasks.

Create the Modelfile

Create a file named Modelfile_qwq_32b_fp16_tc:
FROM qwq:32b-fp16
PARAMETER num_ctx 110000
PARAMETER temperature 0.2
PARAMETER top_p 0.7
PARAMETER min_p 0.0
PARAMETER top_k 40
PARAMETER repeat_penalty 1.2

Build the Model

# First, pull the base model
ollama pull qwq:32b-fp16

# Build custom model with extended context
ollama create qwq:32b-fp16-tc -f Modelfile_qwq_32b_fp16_tc

# Verify the model was created
ollama list | grep qwq
Hardware Requirements: The QwQ 32B FP16 model requires approximately 71.3 GB VRAM for inference. Ensure your system has sufficient GPU memory before attempting to use this model.

Configure PentAGI

OLLAMA_SERVER_URL=http://localhost:11434
OLLAMA_SERVER_MODEL=qwq:32b-fp16-tc
OLLAMA_SERVER_CONFIG_PATH=/opt/pentagi/conf/ollama-qwq32b-fp16-tc.provider.yml

Example: Llama 3.1 8B with Extended Context

A more resource-friendly option for smaller systems.

Create the Modelfile

Create a file named Modelfile_llama31_8b_instruct_tc:
FROM llama3.1:8b-instruct-q8_0
PARAMETER num_ctx 110000
PARAMETER temperature 0.3
PARAMETER top_p 0.8
PARAMETER min_p 0.0
PARAMETER top_k 20
PARAMETER repeat_penalty 1.1

Build the Model

ollama pull llama3.1:8b-instruct-q8_0
ollama create llama3.1:8b-instruct-tc -f Modelfile_llama31_8b_instruct_tc

Configure PentAGI

OLLAMA_SERVER_URL=http://localhost:11434
OLLAMA_SERVER_MODEL=llama3.1:8b-instruct-tc
OLLAMA_SERVER_CONFIG_PATH=/opt/pentagi/conf/ollama-llama318b.provider.yml

Provider Configuration Files

PentAGI includes pre-built provider configuration files for custom Ollama models:
  • /opt/pentagi/conf/ollama-llama318b.provider.yml
  • /opt/pentagi/conf/ollama-qwen332b-fp16-tc.provider.yml
  • /opt/pentagi/conf/ollama-qwq32b-fp16-tc.provider.yml
These files map agent types to specific models with optimized settings.

Example Provider Configuration

simple:
  model: "qwen3:32b-fp16-tc"
  temperature: 0.2
  top_p: 0.3
  n: 1
  max_tokens: 4000

simple_json:
  model: "qwen3:32b-fp16-tc"
  temperature: 0.7
  top_p: 1.0
  n: 1
  max_tokens: 4000
  json: true

primary_agent:
  model: "qwen3:32b-fp16-tc"
  temperature: 0.3
  top_p: 0.95
  n: 1
  max_tokens: 12000

pentester:
  model: "qwen3:32b-fp16-tc"
  temperature: 0.3
  top_p: 0.8
  n: 1
  max_tokens: 8000

Testing Your Custom Model

Use the ctester utility to validate your custom model:
# Test with custom Ollama configuration
docker run --rm \
  -v $(pwd)/.env:/opt/pentagi/.env \
  vxcontrol/pentagi /opt/pentagi/bin/ctester -config /opt/pentagi/conf/ollama-qwen332b-fp16-tc.provider.yml

# Test specific agents
docker exec -it pentagi /opt/pentagi/bin/ctester \
  -agents pentester,coder,primary_agent \
  -config /opt/pentagi/conf/ollama-qwq32b-fp16-tc.provider.yml

Model Management

Auto-Pull Configuration

Configure automatic model downloads:
OLLAMA_SERVER_PULL_MODELS_ENABLED=true
OLLAMA_SERVER_PULL_MODELS_TIMEOUT=900  # 15 minutes
OLLAMA_SERVER_LOAD_MODELS_ENABLED=true
Performance Consideration: Model discovery adds 1-2s startup latency. Disable both flags and specify models in config file for fastest startup.

List Available Models

# List all Ollama models
ollama list

# Remove unused models
ollama rm model-name

# Show model information
ollama show model-name

Hardware Requirements

ModelQuantizationVRAM RequiredRecommended GPU
Llama 3.1 8BQ8_0~9 GBRTX 3090, RTX 4080
Llama 3.1 8BFP16~18 GBRTX 3090, A5000
Qwen3 32BQ4_0~20 GBRTX 4090, A5000
Qwen3 32BFP16~70 GBA100 40GB (x2), H100
QwQ 32BFP16~71 GBA100 40GB (x2), H100

Best Practices

Match Context to Provider

Set num_ctx to 110000 for consistency with PentAGI’s context management

Start Small

Begin with 8B models and scale up as needed for your hardware

Test Before Production

Use ctester to validate model performance before deployment

Monitor Resource Usage

Watch GPU memory and adjust batch sizes if needed

Troubleshooting

  • Verify base model is pulled: ollama list
  • Check Modelfile syntax for typos
  • Ensure sufficient disk space for model storage
  • Use quantized models (Q4_0, Q8_0) instead of FP16
  • Reduce batch size in provider config
  • Close other GPU-intensive applications
  • Verify model was created with num_ctx=110000
  • Check ollama show model-name for actual context size
  • Rebuild model if num_ctx is incorrect

Context Management

Optimize token usage and memory

Performance Tuning

Resource management and scaling

Chain Summarization

Advanced context compression

Build docs developers (and LLMs) love