Supported Local Providers
- Ollama: Easy local model deployment
- vLLM: High-performance inference server
- Custom Endpoints: Any OpenAI-compatible API
Ollama
Ollama provides the easiest way to run models locally.1. Install Ollama
2. Pull Models
3. Configure Weaver
Add to~/.weaver/config.json:
API key is not required for Ollama. The
api_key field can be empty or omitted.4. Usage
Model Name Format
Ollama models use the formatollama/model-name:tag:
ollama/llama3.1- Latest Llama 3.1ollama/qwen2.5:14b- Qwen 2.5 14B parameter versionollama/mistral:7b-instruct- Mistral 7B Instruct
ollama/ prefix when sending to Ollama API.
Source: pkg/providers/http_provider.go:55-62
vLLM
vLLM is a high-performance inference server for LLMs.1. Install vLLM
2. Start Server
3. Configure Weaver
Add to~/.weaver/config.json:
4. Usage
Custom OpenAI-Compatible Endpoints
Weaver works with any OpenAI-compatible API endpoint.Local Server Examples
LocalAI
LM Studio
Jan
Configuration Options
API key for authentication (optional for local servers)
Local server endpoint URL (e.g.,
http://localhost:11434/v1)HTTP/HTTPS proxy URL (optional)
Model Parameters
Configure model behavior:Maximum tokens in response
Controls randomness (0.0 = deterministic, 2.0 = very random)
Popular Local Models
Llama Family
Qwen Family
Other Models
Implementation Details
Weaver uses the HTTPProvider for all local model providers:- OpenAI-compatible API format
- Standard
/chat/completionsendpoint - Automatic model namespace handling
- Tool calling support (if supported by server)
pkg/providers/http_provider.go
Automatic Provider Detection
Weaver automatically uses Ollama when:Model Name Stripping
Hardware Requirements
Model Size Guidelines
| Model Size | RAM Required | GPU VRAM | Example Models |
|---|---|---|---|
| 7B params | 8GB | 6GB | Llama 3.1 8B, Mistral 7B |
| 13-14B params | 16GB | 12GB | Qwen 2.5 14B |
| 30-34B params | 32GB | 24GB | Mixtral 8x7B |
| 70B params | 64GB | 48GB | Llama 3.1 70B |
Performance Tips
- Use GPU acceleration for faster inference
- Quantize models (4-bit, 8-bit) to reduce memory usage
- Use smaller models for development and testing
- Batch requests when processing multiple prompts
Troubleshooting
Ollama Issues
vLLM Issues
Common Errors
Connection Refused
Connection Refused
- Verify the server is running
- Check the port number is correct
- Ensure no firewall is blocking the connection
- Try
curl http://localhost:PORT/v1/models
Out of Memory
Out of Memory
- Use a smaller model
- Enable quantization (4-bit or 8-bit)
- Reduce
max_tokensin configuration - Close other applications
Slow Inference
Slow Inference
- Use GPU acceleration if available
- Try a smaller model
- Increase vLLM tensor parallel size
- Reduce context window size
Model Not Found
Model Not Found
- For Ollama: Run
ollama pull model-name - For vLLM: Verify model name matches HuggingFace
- Check
ollama listto see installed models
Privacy Benefits
Running models locally provides:- Complete privacy: Data never leaves your machine
- No API costs: No per-token pricing
- Offline operation: Works without internet
- Full control: Customize models and parameters
- No rate limits: Process as many requests as your hardware allows
Next Steps
Provider Overview
Back to all providers
Model Selection
Choose the right model