Skip to main content

Overview

The Ollama provider connects to locally-running Ollama instances to use open-source models like Qwen, Llama, and others. It supports the full OpenAI-compatible API with tool calling. Source: crates/goose/src/providers/ollama.rs

Configuration

Environment Variables

OLLAMA_HOST
string
default:"localhost"
Ollama server host (automatically adds port 11434 for localhost)
OLLAMA_TIMEOUT
number
default:"600"
Request timeout in seconds
GOOSE_INPUT_LIMIT
number
Override context window size (sets num_ctx option)

Setup

# Install Ollama first
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull qwen3

# Start Ollama (usually runs automatically)
ollama serve

# Configure Goose
export OLLAMA_HOST="localhost"  # default
# or for remote Ollama:
export OLLAMA_HOST="http://192.168.1.100:11434"

Supported Models

  • qwen3 (default) - Qwen 3 base model
  • qwen3-vl - Qwen 3 with vision capabilities
  • qwen3-coder:30b - 30B parameter coding model
  • qwen3-coder:480b-cloud - Large cloud-based coding model
  • llama3.3 - Meta’s Llama 3.3
  • codellama - Code-specialized Llama
  • mistral - Mistral models
  • gemma - Google’s Gemma
  • phi - Microsoft’s Phi models
Model Library: https://ollama.com/library

Usage

Basic Usage

use goose::providers::create;
use goose::model::ModelConfig;

// Create with default model (qwen3)
let model_config = ModelConfig::new("qwen3")?;
let provider = create("ollama", model_config, vec![]).await?;

// Stream a response
let messages = vec![Message::user().with_text("Hello!")];
let stream = provider.stream(
    &provider.get_model_config(),
    "session-123",
    "You are a helpful assistant.",
    &messages,
    &[],
).await?;

Custom Configuration

let model_config = ModelConfig::new("llama3.3")?
    .with_temperature(0.8)
    .with_max_tokens(4096)
    .with_context_limit(Some(8192));

let provider = create("ollama", model_config, vec![]).await?;

Setting Context Window

Ollama allows configuring the context window size:
# Set via environment variable (applies to all requests)
export GOOSE_INPUT_LIMIT=16384

# Or via model config
let model_config = ModelConfig::new("qwen3")?
    .with_context_limit(Some(16384));
This sets the num_ctx parameter in Ollama options.

Advanced Features

Tool Calling

Ollama supports native tool calling for compatible models:
use rmcp::model::Tool;

let tools = vec![
    Tool {
        name: "calculator".into(),
        description: Some("Perform calculations".into()),
        input_schema: serde_json::json!({
            "type": "object",
            "properties": {
                "expression": {"type": "string"},
            },
            "required": ["expression"],
        }),
    },
];

let stream = provider.stream(
    &model_config,
    "session-123",
    "You are a helpful assistant.",
    &messages,
    &tools,
).await?;

XML Tool Call Fallback

For models without native tool support, Ollama automatically falls back to XML-based tool calls:
<tool_call>
<name>calculator</name>
<arguments>{"expression": "2+2"}</arguments>
</tool_call>
The provider parses these and converts them to proper tool calls.

Chat Mode

Disable tools in chat-only mode:
export GOOSE_MODE="chat"
In chat mode, tools are filtered out before sending to Ollama.

Vision Models

Use vision-capable models for image understanding:
let model_config = ModelConfig::new("qwen3-vl")?;
let provider = create("ollama", model_config, vec![]).await?;

let messages = vec![
    Message::user()
        .with_image("https://example.com/image.jpg", "image/jpeg")
        .with_text("What's in this image?"),
];

Remote Ollama

Connect to Ollama running on another machine:
# Explicit HTTP URL
export OLLAMA_HOST="http://192.168.1.100:11434"

# HTTPS with custom port
export OLLAMA_HOST="https://ollama.example.com:443"

# Domain without protocol (assumes HTTP)
export OLLAMA_HOST="ollama.local"

Port Handling

  • localhost defaults to port 11434
  • Remote hosts without explicit port: no default port
  • Explicit ports in URL are respected: http://host:8080

Implementation Details

Provider Metadata

impl ProviderDef for OllamaProvider {
    fn metadata() -> ProviderMetadata {
        ProviderMetadata::new(
            "ollama",
            "Ollama",
            "Local open source models",
            "qwen3",
            OLLAMA_KNOWN_MODELS.to_vec(),
            "https://ollama.com/library",
            vec![
                ConfigKey::new("OLLAMA_HOST", true, false, Some("localhost"), true),
                ConfigKey::new("OLLAMA_TIMEOUT", false, false, Some("600"), false),
            ],
        )
    }
}

API Format

Uses OpenAI-compatible chat completions format:
{
  "model": "qwen3",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": true,
  "options": {
    "num_ctx": 8192
  }
}
Endpoint: POST /v1/chat/completions

Context Window Configuration

fn resolve_ollama_num_ctx(model_config: &ModelConfig) -> Option<usize> {
    let config = Config::global();
    
    // Priority: GOOSE_INPUT_LIMIT > model_config.context_limit
    config.get_param::<usize>("GOOSE_INPUT_LIMIT")
        .ok()
        .or(model_config.context_limit)
}

fn apply_ollama_options(payload: &mut Value, model_config: &ModelConfig) {
    if let Some(limit) = resolve_ollama_num_ctx(model_config) {
        payload["options"]["num_ctx"] = json!(limit);
    }
}

No Authentication

Ollama doesn’t require authentication:
let api_client = ApiClient::with_timeout(
    base_url,
    AuthMethod::NoAuth,
    timeout,
)?;

Streaming

Ollama supports standard SSE streaming with a custom parser that handles XML tool calls:
fn stream_ollama(response: Response, log: RequestLog) -> Result<MessageStream> {
    let message_stream = response_to_streaming_message_ollama(framed);
    // Custom parser that:
    // 1. Buffers text when XML tool tags detected
    // 2. Parses complete <tool_call> blocks
    // 3. Converts to proper tool call messages
    // 4. Prevents duplicate content emission
}

Fetching Available Models

// Get all locally pulled models
let models = provider.fetch_supported_models().await?;

// Queries: GET /api/tags
Example response:
{
  "models": [
    {
      "name": "qwen3:latest",
      "modified_at": "2024-03-04T12:00:00Z",
      "size": 4661211648
    },
    {
      "name": "llama3.3:latest",
      "modified_at": "2024-03-03T10:00:00Z",
      "size": 8661211648
    }
  ]
}

Error Handling

match provider.stream(...).await {
    Ok(stream) => { /* handle stream */ },
    Err(ProviderError::RequestFailed(msg)) => {
        if msg.contains("connection refused") {
            eprintln!("Ollama not running. Start with: ollama serve");
        } else {
            eprintln!("Request failed: {}", msg);
        }
    },
    Err(e) => eprintln!("Error: {}", e),
}

Custom Provider Configuration

Create from declarative config:
use goose::config::declarative_providers::DeclarativeProviderConfig;
use goose::providers::ollama::OllamaProvider;

let config = DeclarativeProviderConfig {
    name: "my-ollama".to_string(),
    engine: ProviderEngine::Ollama,
    base_url: "http://192.168.1.100:11434".to_string(),
    supports_streaming: Some(true),
    // ... other fields
};

let provider = OllamaProvider::from_custom_config(model_config, config)?;

Performance Tips

1. Adjust Context Window

Smaller context = faster inference:
export GOOSE_INPUT_LIMIT=4096  # Faster
# vs default 8192 or higher

2. Use Quantized Models

Smaller quantized models are faster:
ollama pull qwen3:q4_0  # 4-bit quantization
ollama pull qwen3:q8_0  # 8-bit (more accurate, slower)

3. GPU Acceleration

Ollama automatically uses GPU when available. Check with:
ollama list
# Shows loaded models and GPU usage

4. Keep Models Loaded

Models stay in memory after first use. For faster subsequent requests:
ollama run qwen3
# Keep this running in background

Troubleshooting

Ollama Not Running

# Check if running
curl http://localhost:11434/api/tags

# Start Ollama
ollama serve

Model Not Found

# List available models
ollama list

# Pull missing model
ollama pull qwen3

Out of Memory

Reduce context window or use smaller model:
export GOOSE_INPUT_LIMIT=2048
# or
ollama pull qwen3:7b  # instead of qwen3:30b

See Also

Build docs developers (and LLMs) love