Environment Variables
URL of your Ollama server (e.g.,
http://localhost:11434 or http://ollama-server:11434).Default model for inference. Must be available on your Ollama server.
Path to custom agent configuration YAML file for per-agent model selection.
Timeout for model downloads in seconds (default: 10 minutes).
Automatically download missing models on startup.
Query Ollama server for available models on startup.
Configuration Examples
Basic Setup
Production Setup
Custom Configuration
Performance Considerations
| Feature | Performance Impact |
|---|---|
Model Discovery (LOAD_MODELS_ENABLED=true) | +1-2s startup latency |
Auto-pull (PULL_MODELS_ENABLED=true) | +several minutes first startup |
| Static Config (both disabled) | Fastest startup |
For fastest startup, disable both discovery and auto-pull flags and specify models directly in your configuration file.
Supported Models
Ollama supports a wide range of open-source models:Meta Llama
- llama3.1:8b - Efficient 8B parameter model
- llama3.1:70b - Powerful 70B parameter model
- llama3.1:405b - Largest Llama model (requires significant VRAM)
- llama3.2:1b - Ultra-lightweight model
- llama3.2:3b - Small efficient model
Alibaba Qwen
- qwen3:32b-fp16 - High-quality 32B model in FP16
- qwen3:72b - Advanced reasoning capabilities
- qwq:32b-fp16 - Question-answering optimized (71.3 GB VRAM)
Other Models
- mistral:7b - Fast and capable
- mixtral:8x7b - Mixture of experts model
- phi3:mini - Microsoft’s compact model
- codellama:13b - Code-specialized model
Creating Custom Models with Extended Context
Example: Qwen3 32B FP16 with Extended Context
Create a Modelfile namedModelfile_qwen3_32b_fp16_tc:
Example: QwQ 32B FP16 with Extended Context
Create a Modelfile namedModelfile_qwq_32b_fp16_tc:
The QwQ 32B FP16 model requires approximately 71.3 GB VRAM for inference. Ensure your system has sufficient GPU memory before attempting to use this model.
ollama-qwen332b-fp16-tc.provider.ymlollama-qwq32b-fp16-tc.provider.yml
/opt/pentagi/conf/.
Model Configuration Example
Exampleollama-llama318b.provider.yml configuration:
Hardware Requirements
Minimum requirements vary by model size:| Model Size | VRAM Required | RAM Required | Use Case |
|---|---|---|---|
| 1B-3B params | 2-4 GB | 8 GB | Lightweight tasks |
| 7B-8B params | 6-8 GB | 16 GB | General purpose |
| 13B-14B params | 12-16 GB | 32 GB | Advanced tasks |
| 32B params (FP16) | 64+ GB | 64 GB | High-quality inference |
| 70B params | 48-80 GB | 128 GB | Enterprise workloads |
| 405B params | 200+ GB | 256+ GB | Research/benchmarking |
For best performance:
- Use GPU acceleration (NVIDIA CUDA, AMD ROCm, or Apple Metal)
- Enable memory mapping for models larger than VRAM
- Use quantized models (Q4, Q5, Q8) to reduce memory footprint
Model Management
List Available Models
Pull a Model
Remove a Model
Show Model Info
Deployment Options
Local Deployment
Run Ollama on the same machine as PentAGI:Docker Deployment
Run Ollama in a container:Remote Deployment
Run Ollama on a dedicated GPU server:Performance Optimization
GPU Acceleration
Ollama automatically uses available GPU:- NVIDIA: CUDA acceleration
- AMD: ROCm support
- Apple Silicon: Metal acceleration
Parallel Requests
Configure concurrent model loading:Context Size
Balance context size and memory usage:Troubleshooting
Connection Errors
If PentAGI cannot connect to Ollama:- Verify Ollama is running:
curl http://localhost:11434/api/tags - Check firewall settings allow port 11434
- Ensure
OLLAMA_SERVER_URLis correct in.env
Model Not Found
If model is not available:- List models:
ollama list - Pull model:
ollama pull model-name - Enable auto-pull:
OLLAMA_SERVER_PULL_MODELS_ENABLED=true
Out of Memory
If running out of VRAM:- Use smaller quantized model (Q4, Q5 instead of FP16)
- Reduce
num_ctxin Modelfile - Use CPU inference with
OLLAMA_CPU_ONLY=1 - Upgrade GPU or use remote GPU server
Slow Inference
If inference is slow:- Verify GPU acceleration is enabled:
ollama ps - Use appropriate model size for your hardware
- Enable concurrent requests:
OLLAMA_NUM_PARALLEL=4 - Consider using smaller model or quantization
Benefits of Local Deployment
- Zero Cost: No API fees or token costs
- Privacy: All data stays on your infrastructure
- Offline Operation: Works without internet connectivity
- Customization: Full control over models and parameters
- No Rate Limits: Limited only by your hardware
- Compliance: Meet data residency requirements