Ollama Provider

Overview

The Ollama provider connects to locally-running Ollama instances to use open-source models like Qwen, Llama, and others. It supports the full OpenAI-compatible API with tool calling. Source: crates/goose/src/providers/ollama.rs

Configuration

Environment Variables

OLLAMA_HOST

string

default:"localhost"

Ollama server host (automatically adds port 11434 for localhost)

OLLAMA_TIMEOUT

number

default:"600"

Request timeout in seconds

GOOSE_INPUT_LIMIT

number

Override context window size (sets num_ctx option)

Setup

# Install Ollama first
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull qwen3

# Start Ollama (usually runs automatically)
ollama serve

# Configure Goose
export OLLAMA_HOST="localhost"  # default
# or for remote Ollama:
export OLLAMA_HOST="http://192.168.1.100:11434"

Supported Models

Recommended Models

qwen3 (default) - Qwen 3 base model
qwen3-vl - Qwen 3 with vision capabilities
qwen3-coder:30b - 30B parameter coding model
qwen3-coder:480b-cloud - Large cloud-based coding model

Other Popular Models

llama3.3 - Meta’s Llama 3.3
codellama - Code-specialized Llama
mistral - Mistral models
gemma - Google’s Gemma
phi - Microsoft’s Phi models

Model Library: https://ollama.com/library

Usage

Basic Usage

use goose::providers::create;
use goose::model::ModelConfig;

// Create with default model (qwen3)
let model_config = ModelConfig::new("qwen3")?;
let provider = create("ollama", model_config, vec![]).await?;

// Stream a response
let messages = vec![Message::user().with_text("Hello!")];
let stream = provider.stream(
    &provider.get_model_config(),
    "session-123",
    "You are a helpful assistant.",
    &messages,
    &[],
).await?;

Custom Configuration

let model_config = ModelConfig::new("llama3.3")?
    .with_temperature(0.8)
    .with_max_tokens(4096)
    .with_context_limit(Some(8192));

let provider = create("ollama", model_config, vec![]).await?;

Setting Context Window

Ollama allows configuring the context window size:

# Set via environment variable (applies to all requests)
export GOOSE_INPUT_LIMIT=16384

# Or via model config
let model_config = ModelConfig::new("qwen3")?
    .with_context_limit(Some(16384));

This sets the num_ctx parameter in Ollama options.

Advanced Features

Tool Calling

Ollama supports native tool calling for compatible models:

use rmcp::model::Tool;

let tools = vec![
    Tool {
        name: "calculator".into(),
        description: Some("Perform calculations".into()),
        input_schema: serde_json::json!({
            "type": "object",
            "properties": {
                "expression": {"type": "string"},
            },
            "required": ["expression"],
        }),
    },
];

let stream = provider.stream(
    &model_config,
    "session-123",
    "You are a helpful assistant.",
    &messages,
    &tools,
).await?;

XML Tool Call Fallback

For models without native tool support, Ollama automatically falls back to XML-based tool calls:

<tool_call>
<name>calculator</name>
<arguments>{"expression": "2+2"}</arguments>
</tool_call>

The provider parses these and converts them to proper tool calls.

Chat Mode

Disable tools in chat-only mode:

export GOOSE_MODE="chat"

In chat mode, tools are filtered out before sending to Ollama.

Vision Models

Use vision-capable models for image understanding:

let model_config = ModelConfig::new("qwen3-vl")?;
let provider = create("ollama", model_config, vec![]).await?;

let messages = vec![
    Message::user()
        .with_image("https://example.com/image.jpg", "image/jpeg")
        .with_text("What's in this image?"),
];

Remote Ollama

Connect to Ollama running on another machine:

# Explicit HTTP URL
export OLLAMA_HOST="http://192.168.1.100:11434"

# HTTPS with custom port
export OLLAMA_HOST="https://ollama.example.com:443"

# Domain without protocol (assumes HTTP)
export OLLAMA_HOST="ollama.local"

Port Handling

localhost defaults to port 11434
Remote hosts without explicit port: no default port
Explicit ports in URL are respected: http://host:8080

Implementation Details

Provider Metadata

impl ProviderDef for OllamaProvider {
    fn metadata() -> ProviderMetadata {
        ProviderMetadata::new(
            "ollama",
            "Ollama",
            "Local open source models",
            "qwen3",
            OLLAMA_KNOWN_MODELS.to_vec(),
            "https://ollama.com/library",
            vec![
                ConfigKey::new("OLLAMA_HOST", true, false, Some("localhost"), true),
                ConfigKey::new("OLLAMA_TIMEOUT", false, false, Some("600"), false),
            ],
        )
    }
}

API Format

Uses OpenAI-compatible chat completions format:

{
  "model": "qwen3",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": true,
  "options": {
    "num_ctx": 8192
  }
}

Endpoint: POST /v1/chat/completions

Context Window Configuration

fn resolve_ollama_num_ctx(model_config: &ModelConfig) -> Option<usize> {
    let config = Config::global();
    
    // Priority: GOOSE_INPUT_LIMIT > model_config.context_limit
    config.get_param::<usize>("GOOSE_INPUT_LIMIT")
        .ok()
        .or(model_config.context_limit)
}

fn apply_ollama_options(payload: &mut Value, model_config: &ModelConfig) {
    if let Some(limit) = resolve_ollama_num_ctx(model_config) {
        payload["options"]["num_ctx"] = json!(limit);
    }
}

No Authentication

Ollama doesn’t require authentication:

let api_client = ApiClient::with_timeout(
    base_url,
    AuthMethod::NoAuth,
    timeout,
)?;

Streaming

Ollama supports standard SSE streaming with a custom parser that handles XML tool calls:

fn stream_ollama(response: Response, log: RequestLog) -> Result<MessageStream> {
    let message_stream = response_to_streaming_message_ollama(framed);
    // Custom parser that:
    // 1. Buffers text when XML tool tags detected
    // 2. Parses complete <tool_call> blocks
    // 3. Converts to proper tool call messages
    // 4. Prevents duplicate content emission
}

Fetching Available Models

// Get all locally pulled models
let models = provider.fetch_supported_models().await?;

// Queries: GET /api/tags

Example response:

{
  "models": [
    {
      "name": "qwen3:latest",
      "modified_at": "2024-03-04T12:00:00Z",
      "size": 4661211648
    },
    {
      "name": "llama3.3:latest",
      "modified_at": "2024-03-03T10:00:00Z",
      "size": 8661211648
    }
  ]
}

Error Handling

match provider.stream(...).await {
    Ok(stream) => { /* handle stream */ },
    Err(ProviderError::RequestFailed(msg)) => {
        if msg.contains("connection refused") {
            eprintln!("Ollama not running. Start with: ollama serve");
        } else {
            eprintln!("Request failed: {}", msg);
        }
    },
    Err(e) => eprintln!("Error: {}", e),
}

Custom Provider Configuration

Create from declarative config:

use goose::config::declarative_providers::DeclarativeProviderConfig;
use goose::providers::ollama::OllamaProvider;

let config = DeclarativeProviderConfig {
    name: "my-ollama".to_string(),
    engine: ProviderEngine::Ollama,
    base_url: "http://192.168.1.100:11434".to_string(),
    supports_streaming: Some(true),
    // ... other fields
};

let provider = OllamaProvider::from_custom_config(model_config, config)?;

Performance Tips

1. Adjust Context Window

Smaller context = faster inference:

export GOOSE_INPUT_LIMIT=4096  # Faster
# vs default 8192 or higher

2. Use Quantized Models

Smaller quantized models are faster:

ollama pull qwen3:q4_0  # 4-bit quantization
ollama pull qwen3:q8_0  # 8-bit (more accurate, slower)

3. GPU Acceleration

Ollama automatically uses GPU when available. Check with:

ollama list
# Shows loaded models and GPU usage

4. Keep Models Loaded

Models stay in memory after first use. For faster subsequent requests:

ollama run qwen3
# Keep this running in background

Troubleshooting

Ollama Not Running

# Check if running
curl http://localhost:11434/api/tags

# Start Ollama
ollama serve

Model Not Found

# List available models
ollama list

# Pull missing model
ollama pull qwen3

Out of Memory

Reduce context window or use smaller model:

export GOOSE_INPUT_LIMIT=2048
# or
ollama pull qwen3:7b  # instead of qwen3:30b

Core Library

Providers

Extensions

CLI Commands

Server API

​Overview

​Configuration

​Environment Variables

​Setup

​Supported Models

​Recommended Models

​Other Popular Models

​Usage

​Basic Usage

​Custom Configuration

​Setting Context Window

​Advanced Features

​Tool Calling

​XML Tool Call Fallback

​Chat Mode

​Vision Models

​Remote Ollama

​Port Handling

​Implementation Details

​Provider Metadata

​API Format

​Context Window Configuration

​No Authentication

​Streaming

​Fetching Available Models

​Error Handling

​Custom Provider Configuration

​Performance Tips

​1. Adjust Context Window

​2. Use Quantized Models

​3. GPU Acceleration

​4. Keep Models Loaded

​Troubleshooting

​Ollama Not Running

​Model Not Found

​Out of Memory

​See Also

Build docs developers (and LLMs) love