Overview
The local provider enables you to:- Run models on your own hardware
- Use custom inference servers (Ollama, LM Studio, vLLM, etc.)
- Experiment with fine-tuned or custom models
- Maintain full control over your data and privacy
Your self-hosted provider must implement an OpenAI-compatible API for tool calling and streaming.
Configuration
Setting the endpoint
Configure your self-hosted provider endpoint using theLOCAL_ENDPOINT environment variable:
/v1).
Configuring models
Once the endpoint is set, configure your agents to use local models in your.opencode.json:
.opencode.json
Model names use the format
local.<model-name> where <model-name> matches the model identifier in your inference server.Popular inference servers
Ollama
Ollama provides an easy way to run models locally.Install Ollama
Download and install Ollama from ollama.ai
LM Studio
LM Studio provides a desktop application for running models.Install LM Studio
Download and install LM Studio from lmstudio.ai
vLLM
vLLM is a high-throughput inference server.Text Generation WebUI
Text Generation WebUI provides an OpenAI-compatible API extension.Advanced configuration
Reasoning effort for reasoning models
Some self-hosted models support reasoning modes. Configure reasoning effort if your model supports it:.opencode.json
low- Faster responses with less reasoningmedium- Balanced approach (default)high- More thorough reasoning
Custom headers
If your inference server requires custom headers, you can pass them through OpenCode’s provider options (requires code modification).Authentication
For self-hosted endpoints that require authentication, you can set an API key:.opencode.json
Model requirements
For the best experience with OpenCode, your self-hosted model should support:Tool calling
Tool calling
OpenCode relies heavily on tool/function calling for file operations, code execution, and more. Your model must support the OpenAI tools API format.Models known to work well:
- Llama 3.3 70B Instruct
- Qwen 2.5 Coder
- Granite 3.1 (IBM)
- Mistral Large
Streaming
Streaming
Streaming responses provide a better user experience. Your inference server should support server-sent events (SSE) streaming.
Context window
Context window
A larger context window (32K+ tokens) is recommended for complex coding tasks. Configure
maxTokens according to your model’s capabilities.Troubleshooting
Connection refused error
Connection refused error
Symptoms: Cannot connect to local endpointSolution:
- Verify your inference server is running
- Check the endpoint URL is correct (including port and
/v1path) - Ensure there are no firewall rules blocking the connection
Model not found
Model not found
Symptoms: Error indicating the model doesn’t existSolution:
- Check your model name matches exactly what’s loaded in your inference server
- List available models:
Tool calling not working
Tool calling not working
Symptoms: The model doesn’t use tools or returns invalid tool callsSolution:
- Ensure your model supports function/tool calling
- Check that your inference server properly implements the OpenAI tools API
- Try a different model known to support tool calling (e.g., Llama 3.3)
Slow response times
Slow response times
Symptoms: Responses take a long time to generateSolution:
- Use a smaller, faster model for the
taskandtitleagents - Enable GPU acceleration in your inference server
- Reduce
maxTokensfor faster responses - Consider quantized models (Q4, Q8) for better performance
Out of memory errors
Out of memory errors
Symptoms: Inference server crashes or returns memory errorsSolution:
- Use a smaller model or quantized version
- Reduce the context window size
- Allocate more RAM/VRAM to your inference server
- Enable CPU offloading if using GPU inference
Performance tips
Use appropriate model sizes
- Large models (70B+) for complex tasks (coder agent)
- Small models (7B-13B) for simple tasks (title agent)
Enable GPU acceleration
Most inference servers support GPU acceleration for significant speed improvements.
Quantization
Use quantized models (Q4_K_M, Q8_0) to reduce memory usage while maintaining quality.
Batch processing
Configure your inference server for optimal batch size and parallel requests.
Example configurations
Development setup (fast iteration)
.opencode.json
Production setup (high quality)
.opencode.json