Skip to main content
OpenCode supports using self-hosted model providers that implement an OpenAI-compatible API. This allows you to run models locally or connect to custom inference servers.

Overview

The local provider enables you to:
  • Run models on your own hardware
  • Use custom inference servers (Ollama, LM Studio, vLLM, etc.)
  • Experiment with fine-tuned or custom models
  • Maintain full control over your data and privacy
Your self-hosted provider must implement an OpenAI-compatible API for tool calling and streaming.

Configuration

Setting the endpoint

Configure your self-hosted provider endpoint using the LOCAL_ENDPOINT environment variable:
export LOCAL_ENDPOINT="http://localhost:1234/v1"
The endpoint should include the full base URL including the API version path (usually /v1).

Configuring models

Once the endpoint is set, configure your agents to use local models in your .opencode.json:
.opencode.json
{
  "agents": {
    "coder": {
      "model": "local.granite-3.3-2b-instruct@q8_0",
      "maxTokens": 5000
    },
    "task": {
      "model": "local.llama-3.3-70b-instruct",
      "maxTokens": 4096
    },
    "title": {
      "model": "local.granite-3.3-2b-instruct@q8_0",
      "maxTokens": 80
    }
  }
}
Model names use the format local.<model-name> where <model-name> matches the model identifier in your inference server.

Ollama

Ollama provides an easy way to run models locally.
1

Install Ollama

Download and install Ollama from ollama.ai
2

Pull a model

ollama pull llama3.3:70b
3

Configure OpenCode

Set the Ollama API endpoint:
export LOCAL_ENDPOINT="http://localhost:11434/v1"
Update your configuration:
.opencode.json
{
  "agents": {
    "coder": {
      "model": "local.llama3.3:70b",
      "maxTokens": 5000
    }
  }
}

LM Studio

LM Studio provides a desktop application for running models.
1

Install LM Studio

Download and install LM Studio from lmstudio.ai
2

Load a model

Use LM Studio’s interface to download and load a model
3

Start the server

Enable the local server in LM Studio (usually runs on port 1234)
4

Configure OpenCode

export LOCAL_ENDPOINT="http://localhost:1234/v1"
.opencode.json
{
  "agents": {
    "coder": {
      "model": "local.your-model-name",
      "maxTokens": 5000
    }
  }
}

vLLM

vLLM is a high-throughput inference server.
1

Install vLLM

pip install vllm
2

Start the server

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --port 8000
3

Configure OpenCode

export LOCAL_ENDPOINT="http://localhost:8000/v1"
.opencode.json
{
  "agents": {
    "coder": {
      "model": "local.meta-llama/Llama-3.3-70B-Instruct",
      "maxTokens": 5000
    }
  }
}

Text Generation WebUI

Text Generation WebUI provides an OpenAI-compatible API extension.
1

Enable OpenAI extension

Enable the openai extension in the WebUI settings
2

Configure OpenCode

export LOCAL_ENDPOINT="http://localhost:5000/v1"
.opencode.json
{
  "agents": {
    "coder": {
      "model": "local.your-model-name",
      "maxTokens": 5000
    }
  }
}

Advanced configuration

Reasoning effort for reasoning models

Some self-hosted models support reasoning modes. Configure reasoning effort if your model supports it:
.opencode.json
{
  "agents": {
    "coder": {
      "model": "local.deepseek-r1-70b",
      "maxTokens": 5000,
      "reasoningEffort": "high"
    }
  }
}
Valid values:
  • low - Faster responses with less reasoning
  • medium - Balanced approach (default)
  • high - More thorough reasoning

Custom headers

If your inference server requires custom headers, you can pass them through OpenCode’s provider options (requires code modification).

Authentication

For self-hosted endpoints that require authentication, you can set an API key:
.opencode.json
{
  "providers": {
    "local": {
      "apiKey": "your-api-key-here"
    }
  }
}

Model requirements

For the best experience with OpenCode, your self-hosted model should support:
OpenCode relies heavily on tool/function calling for file operations, code execution, and more. Your model must support the OpenAI tools API format.Models known to work well:
  • Llama 3.3 70B Instruct
  • Qwen 2.5 Coder
  • Granite 3.1 (IBM)
  • Mistral Large
Streaming responses provide a better user experience. Your inference server should support server-sent events (SSE) streaming.
A larger context window (32K+ tokens) is recommended for complex coding tasks. Configure maxTokens according to your model’s capabilities.

Troubleshooting

Symptoms: Cannot connect to local endpointSolution:
  • Verify your inference server is running
  • Check the endpoint URL is correct (including port and /v1 path)
  • Ensure there are no firewall rules blocking the connection
# Test the endpoint
curl http://localhost:1234/v1/models
Symptoms: Error indicating the model doesn’t existSolution:
  • Check your model name matches exactly what’s loaded in your inference server
  • List available models:
curl http://localhost:1234/v1/models
Symptoms: The model doesn’t use tools or returns invalid tool callsSolution:
  • Ensure your model supports function/tool calling
  • Check that your inference server properly implements the OpenAI tools API
  • Try a different model known to support tool calling (e.g., Llama 3.3)
Symptoms: Responses take a long time to generateSolution:
  • Use a smaller, faster model for the task and title agents
  • Enable GPU acceleration in your inference server
  • Reduce maxTokens for faster responses
  • Consider quantized models (Q4, Q8) for better performance
Symptoms: Inference server crashes or returns memory errorsSolution:
  • Use a smaller model or quantized version
  • Reduce the context window size
  • Allocate more RAM/VRAM to your inference server
  • Enable CPU offloading if using GPU inference

Performance tips

Use appropriate model sizes

  • Large models (70B+) for complex tasks (coder agent)
  • Small models (7B-13B) for simple tasks (title agent)

Enable GPU acceleration

Most inference servers support GPU acceleration for significant speed improvements.

Quantization

Use quantized models (Q4_K_M, Q8_0) to reduce memory usage while maintaining quality.

Batch processing

Configure your inference server for optimal batch size and parallel requests.

Example configurations

Development setup (fast iteration)

.opencode.json
{
  "agents": {
    "coder": {
      "model": "local.qwen2.5-coder:7b-instruct-q4_K_M",
      "maxTokens": 4096
    },
    "task": {
      "model": "local.qwen2.5-coder:7b-instruct-q4_K_M",
      "maxTokens": 2048
    },
    "title": {
      "model": "local.granite-3.3-2b-instruct@q8_0",
      "maxTokens": 80
    }
  }
}

Production setup (high quality)

.opencode.json
{
  "agents": {
    "coder": {
      "model": "local.llama-3.3-70b-instruct",
      "maxTokens": 8192
    },
    "task": {
      "model": "local.qwen2.5-coder:32b-instruct",
      "maxTokens": 4096
    },
    "title": {
      "model": "local.llama-3.3-70b-instruct",
      "maxTokens": 80
    }
  }
}

Build docs developers (and LLMs) love