Self-hosted model providers

OpenCode supports using self-hosted model providers that implement an OpenAI-compatible API. This allows you to run models locally or connect to custom inference servers.

Overview

The local provider enables you to:

Run models on your own hardware
Use custom inference servers (Ollama, LM Studio, vLLM, etc.)
Experiment with fine-tuned or custom models
Maintain full control over your data and privacy

Your self-hosted provider must implement an OpenAI-compatible API for tool calling and streaming.

Configuration

Setting the endpoint

Configure your self-hosted provider endpoint using the LOCAL_ENDPOINT environment variable:

export LOCAL_ENDPOINT="http://localhost:1234/v1"

The endpoint should include the full base URL including the API version path (usually /v1).

Configuring models

Once the endpoint is set, configure your agents to use local models in your .opencode.json:

.opencode.json

{
  "agents": {
    "coder": {
      "model": "local.granite-3.3-2b-instruct@q8_0",
      "maxTokens": 5000
    },
    "task": {
      "model": "local.llama-3.3-70b-instruct",
      "maxTokens": 4096
    },
    "title": {
      "model": "local.granite-3.3-2b-instruct@q8_0",
      "maxTokens": 80
    }
  }
}

Model names use the format local.<model-name> where <model-name> matches the model identifier in your inference server.

Popular inference servers

Ollama

Ollama provides an easy way to run models locally.

Install Ollama

Download and install Ollama from ollama.ai

Pull a model

ollama pull llama3.3:70b

Configure OpenCode

Set the Ollama API endpoint:

export LOCAL_ENDPOINT="http://localhost:11434/v1"

Update your configuration:

.opencode.json

{
  "agents": {
    "coder": {
      "model": "local.llama3.3:70b",
      "maxTokens": 5000
    }
  }
}

LM Studio

LM Studio provides a desktop application for running models.

Install LM Studio

Download and install LM Studio from lmstudio.ai

Load a model

Use LM Studio’s interface to download and load a model

Start the server

Enable the local server in LM Studio (usually runs on port 1234)

Configure OpenCode

export LOCAL_ENDPOINT="http://localhost:1234/v1"

.opencode.json

{
  "agents": {
    "coder": {
      "model": "local.your-model-name",
      "maxTokens": 5000
    }
  }
}

vLLM

vLLM is a high-throughput inference server.

Install vLLM

pip install vllm

Start the server

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --port 8000

Configure OpenCode

export LOCAL_ENDPOINT="http://localhost:8000/v1"

.opencode.json

{
  "agents": {
    "coder": {
      "model": "local.meta-llama/Llama-3.3-70B-Instruct",
      "maxTokens": 5000
    }
  }
}

Text Generation WebUI

Text Generation WebUI provides an OpenAI-compatible API extension.

Enable OpenAI extension

Enable the openai extension in the WebUI settings

Configure OpenCode

export LOCAL_ENDPOINT="http://localhost:5000/v1"

.opencode.json

{
  "agents": {
    "coder": {
      "model": "local.your-model-name",
      "maxTokens": 5000
    }
  }
}

Advanced configuration

Reasoning effort for reasoning models

Some self-hosted models support reasoning modes. Configure reasoning effort if your model supports it:

.opencode.json

{
  "agents": {
    "coder": {
      "model": "local.deepseek-r1-70b",
      "maxTokens": 5000,
      "reasoningEffort": "high"
    }
  }
}

Valid values:

low - Faster responses with less reasoning
medium - Balanced approach (default)
high - More thorough reasoning

Custom headers

If your inference server requires custom headers, you can pass them through OpenCode’s provider options (requires code modification).

Authentication

For self-hosted endpoints that require authentication, you can set an API key:

.opencode.json

{
  "providers": {
    "local": {
      "apiKey": "your-api-key-here"
    }
  }
}

Model requirements

For the best experience with OpenCode, your self-hosted model should support:

Tool calling

OpenCode relies heavily on tool/function calling for file operations, code execution, and more. Your model must support the OpenAI tools API format.Models known to work well:

Llama 3.3 70B Instruct
Qwen 2.5 Coder
Granite 3.1 (IBM)
Mistral Large

Streaming

Streaming responses provide a better user experience. Your inference server should support server-sent events (SSE) streaming.

Context window

A larger context window (32K+ tokens) is recommended for complex coding tasks. Configure maxTokens according to your model’s capabilities.

Troubleshooting

Connection refused error

Symptoms: Cannot connect to local endpointSolution:

Verify your inference server is running
Check the endpoint URL is correct (including port and /v1 path)
Ensure there are no firewall rules blocking the connection

# Test the endpoint
curl http://localhost:1234/v1/models

Model not found

Symptoms: Error indicating the model doesn’t existSolution:

Check your model name matches exactly what’s loaded in your inference server
List available models:

curl http://localhost:1234/v1/models

Tool calling not working

Symptoms: The model doesn’t use tools or returns invalid tool callsSolution:

Ensure your model supports function/tool calling
Check that your inference server properly implements the OpenAI tools API
Try a different model known to support tool calling (e.g., Llama 3.3)

Slow response times

Symptoms: Responses take a long time to generateSolution:

Use a smaller, faster model for the task and title agents
Enable GPU acceleration in your inference server
Reduce maxTokens for faster responses
Consider quantized models (Q4, Q8) for better performance

Out of memory errors

Symptoms: Inference server crashes or returns memory errorsSolution:

Use a smaller model or quantized version
Reduce the context window size
Allocate more RAM/VRAM to your inference server
Enable CPU offloading if using GPU inference

Performance tips

Use appropriate model sizes

Large models (70B+) for complex tasks (coder agent)
Small models (7B-13B) for simple tasks (title agent)

Enable GPU acceleration

Most inference servers support GPU acceleration for significant speed improvements.

Quantization

Use quantized models (Q4_K_M, Q8_0) to reduce memory usage while maintaining quality.

Batch processing

Configure your inference server for optimal batch size and parallel requests.

Example configurations

Development setup (fast iteration)

.opencode.json

{
  "agents": {
    "coder": {
      "model": "local.qwen2.5-coder:7b-instruct-q4_K_M",
      "maxTokens": 4096
    },
    "task": {
      "model": "local.qwen2.5-coder:7b-instruct-q4_K_M",
      "maxTokens": 2048
    },
    "title": {
      "model": "local.granite-3.3-2b-instruct@q8_0",
      "maxTokens": 80
    }
  }
}

Production setup (high quality)

.opencode.json

{
  "agents": {
    "coder": {
      "model": "local.llama-3.3-70b-instruct",
      "maxTokens": 8192
    },
    "task": {
      "model": "local.qwen2.5-coder:32b-instruct",
      "maxTokens": 4096
    },
    "title": {
      "model": "local.llama-3.3-70b-instruct",
      "maxTokens": 80
    }
  }
}

Get Started

Core Concepts

Usage Guide

Features

Advanced

Self-hosted model providers

Overview

Configuration

Setting the endpoint

Configuring models

Popular inference servers

Ollama

LM Studio

vLLM

Text Generation WebUI

Advanced configuration

Reasoning effort for reasoning models

Custom headers

Authentication

Model requirements

Troubleshooting

Performance tips

Use appropriate model sizes

Enable GPU acceleration

Quantization

Batch processing

Example configurations

Development setup (fast iteration)

Production setup (high quality)

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guide

Features

Advanced

​Overview

​Configuration

​Setting the endpoint

​Configuring models

​Popular inference servers

​Ollama

​LM Studio

​vLLM

​Text Generation WebUI

​Advanced configuration

​Reasoning effort for reasoning models

​Custom headers

​Authentication

​Model requirements

​Troubleshooting

​Performance tips

Use appropriate model sizes

Enable GPU acceleration

Quantization

Batch processing

​Example configurations

​Development setup (fast iteration)

​Production setup (high quality)

Build docs developers (and LLMs) love

Overview

Configuration

Setting the endpoint

Configuring models

Popular inference servers

Ollama

LM Studio

vLLM

Text Generation WebUI

Advanced configuration

Reasoning effort for reasoning models

Custom headers

Authentication

Model requirements

Troubleshooting

Performance tips

Example configurations

Development setup (fast iteration)

Production setup (high quality)