Skip to main content
Ollama enables running large language models locally on your machine.

Overview

Ollama provides:
  • Local model execution (no API costs)
  • Privacy (data stays on your machine)
  • Offline operation
  • Fast inference on local hardware
Supported models:
  • Llama 2/3
  • Mistral
  • Mixtral
  • Phi
  • Gemma
  • And more

Prerequisites

Install Ollama

brew install ollama

Start Ollama Server

ollama serve
Default endpoint: http://localhost:11434

Pull a Model

# Llama 3 8B (recommended for most use cases)
ollama pull llama3

# Smaller models (faster, less capable)
ollama pull phi3
ollama pull gemma:2b

# Larger models (more capable, slower)
ollama pull llama3:70b
ollama pull mixtral:8x7b
List installed models:
ollama list

Configuration

Config File

[agent]
provider = "ollama"
model = "llama3"  # Model name from 'ollama list'

[providers.ollama]
base_url = "http://localhost:11434"  # Ollama server URL

CLI Usage

zeroclaw agent --provider ollama --model llama3

Features

Tool Calling

Ollama supports tool calling for compatible models:
[providers.ollama]
tool_calling = "native"  # Use model's native function calling
Models with tool support:
  • llama3:70b
  • mixtral:8x7b
  • mistral
Smaller models may have limited tool calling capabilities.

Streaming

Real-time response streaming:
[providers.ollama]
stream = true

Custom Parameters

[providers.ollama]
temperature = 0.7
top_p = 0.9
top_k = 40
repeat_penalty = 1.1

Model Selection Guide

For General Use

# Best balance (8GB RAM minimum)
ollama pull llama3

# Faster, less capable (4GB RAM)
ollama pull phi3

For Coding

ollama pull codellama
ollama pull deepseek-coder

For Maximum Quality

# Requires 40GB+ RAM
ollama pull llama3:70b
ollama pull mixtral:8x7b

Performance Tuning

GPU Acceleration

Ollama automatically uses GPU if available (CUDA, Metal, ROCm). Check GPU usage:
ollama ps

Context Window

Adjust context size:
[providers.ollama]
num_ctx = 4096  # Default: 2048
Larger contexts use more memory but allow longer conversations.

Batch Size

[providers.ollama]
num_batch = 512  # Default: 512

Request Format

Ollama uses a simple JSON format:
{
  "model": "llama3",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "options": {
    "temperature": 0.7,
    "top_p": 0.9
  }
}

Troubleshooting

Solution:Start Ollama server:
ollama serve
Verify it’s running:
curl http://localhost:11434/api/tags
Solution:Pull the model first:
ollama pull llama3
ollama list
Solutions:
  1. Use smaller model:
ollama pull phi3
  1. Reduce context window:
[providers.ollama]
num_ctx = 2048
  1. Enable GPU acceleration (requires compatible hardware)
Solutions:
  1. Use smaller model
  2. Reduce context window
  3. Close other applications
  4. Reduce num_batch

Example: Complete Setup

# Install Ollama
brew install ollama

# Start server (in separate terminal)
ollama serve &

# Pull model
ollama pull llama3

# Configure ZeroClaw
zeroclaw config set agent.provider ollama
zeroclaw config set agent.model llama3

# Test
zeroclaw agent -m "Hello!"

Remote Ollama

Connect to Ollama running on another machine:
[providers.ollama]
base_url = "http://192.168.1.100:11434"

Docker Deployment

Run Ollama in Docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec ollama ollama pull llama3

Build docs developers (and LLMs) love