Ollama
Ollama is a popular tool for running LLMs locally. docker-agent includes a built-inollama alias for easy configuration.
Setup
- Install Ollama from ollama.ai.
-
Pull a model:
-
Start the Ollama server (runs automatically after install on most platforms):
Configuration
Use the built-inollama alias — no API key required:
ollama alias uses:
- Base URL:
http://localhost:11434/v1 - API type: OpenAI-compatible
- Auth: None required
Custom host or port
If Ollama runs on a different host or port, define a named model with an explicitbase_url:
Popular Ollama models
| Model | Size | Best for |
|---|---|---|
llama3.2 | 3B | General purpose, fast |
llama3.1 | 8B | Better reasoning |
qwen2.5-coder | 7B | Code generation |
mistral | 7B | General purpose |
codellama | 7B | Code tasks |
deepseek-coder | 6.7B | Code generation |
vLLM
vLLM is a high-performance inference server optimized for throughput.Setup
Configuration
LocalAI
LocalAI provides an OpenAI-compatible API that works with various model backends.Setup
Configuration
Any OpenAI-compatible server
For any server that implements the/v1/chat/completions endpoint:
Performance considerations
- Memory: Larger models need more RAM or VRAM. A 7B model typically requires 8–16 GB RAM.
- GPU: GPU acceleration dramatically improves inference speed. Check your server’s GPU support.
- Context length: Local models often have smaller context windows than cloud models.
- Tool calling: Not all local models support function/tool calling. Test your model’s capabilities before deploying.
Example: offline development agent
Troubleshooting
Connection refused: Ensure your model server is running and accessible:max_tokens in your config.