Local Models - Weaver

Weaver supports running AI models locally for privacy, cost savings, and offline operation. This guide covers Ollama, vLLM, and custom OpenAI-compatible endpoints.

Supported Local Providers

Ollama: Easy local model deployment
vLLM: High-performance inference server
Custom Endpoints: Any OpenAI-compatible API

Ollama

Ollama provides the easiest way to run models locally.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download from https://ollama.ai/download

2. Pull Models

# Pull Llama 3.1
ollama pull llama3.1

# Pull Qwen 2.5
ollama pull qwen2.5:14b

# Pull Mistral
ollama pull mistral

3. Configure Weaver

Add to ~/.weaver/config.json:

{
  "providers": {
    "ollama": {
      "api_key": "",
      "api_base": "http://localhost:11434/v1"
    }
  },
  "agents": {
    "defaults": {
      "provider": "ollama",
      "model": "ollama/llama3.1"
    }
  }
}

API key is not required for Ollama. The api_key field can be empty or omitted.

4. Usage

# Use Llama 3.1
weaver chat --model ollama/llama3.1

# Use Qwen 2.5
weaver chat --model ollama/qwen2.5:14b

# Use Mistral
weaver chat --model ollama/mistral

Model Name Format

Ollama models use the format ollama/model-name:tag:

ollama/llama3.1 - Latest Llama 3.1
ollama/qwen2.5:14b - Qwen 2.5 14B parameter version
ollama/mistral:7b-instruct - Mistral 7B Instruct

Weaver automatically strips the ollama/ prefix when sending to Ollama API. Source: pkg/providers/http_provider.go:55-62

vLLM

vLLM is a high-performance inference server for LLMs.

1. Install vLLM

pip install vllm

2. Start Server

# Start with Llama 3.1 8B
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000

# With GPU acceleration
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2 \
  --port 8000

3. Configure Weaver

Add to ~/.weaver/config.json:

{
  "providers": {
    "vllm": {
      "api_key": "",
      "api_base": "http://localhost:8000/v1"
    }
  },
  "agents": {
    "defaults": {
      "provider": "vllm",
      "model": "meta-llama/Llama-3.1-8B-Instruct"
    }
  }
}

4. Usage

weaver chat --model meta-llama/Llama-3.1-8B-Instruct

Custom OpenAI-Compatible Endpoints

Weaver works with any OpenAI-compatible API endpoint.

Local Server Examples

LocalAI

{
  "providers": {
    "vllm": {
      "api_key": "",
      "api_base": "http://localhost:8080/v1"
    }
  }
}

LM Studio

{
  "providers": {
    "vllm": {
      "api_key": "",
      "api_base": "http://localhost:1234/v1"
    }
  }
}

Jan

{
  "providers": {
    "vllm": {
      "api_key": "",
      "api_base": "http://localhost:1337/v1"
    }
  }
}

Configuration Options

api_key

string

API key for authentication (optional for local servers)

api_base

string

Local server endpoint URL (e.g., http://localhost:11434/v1)

proxy

string

HTTP/HTTPS proxy URL (optional)

Model Parameters

Configure model behavior:

{
  "agents": {
    "defaults": {
      "model": "ollama/llama3.1",
      "max_tokens": 4096,
      "temperature": 0.7
    }
  }
}

max_tokens

integer

default:"4096"

Maximum tokens in response

temperature

float

default:"0.7"

Controls randomness (0.0 = deterministic, 2.0 = very random)

Popular Local Models

Llama Family

# Llama 3.1 8B (Recommended)
ollama pull llama3.1

# Llama 3.1 70B (High capability)
ollama pull llama3.1:70b

# Llama 3.1 405B (Largest)
ollama pull llama3.1:405b

Qwen Family

# Qwen 2.5 7B
ollama pull qwen2.5

# Qwen 2.5 14B
ollama pull qwen2.5:14b

# Qwen 2.5 32B
ollama pull qwen2.5:32b

Other Models

# Mistral 7B
ollama pull mistral

# Mixtral 8x7B
ollama pull mixtral

# DeepSeek Coder
ollama pull deepseek-coder

# Phi-3
ollama pull phi3

Implementation Details

Weaver uses the HTTPProvider for all local model providers:

OpenAI-compatible API format
Standard /chat/completions endpoint
Automatic model namespace handling
Tool calling support (if supported by server)

Source: pkg/providers/http_provider.go

Automatic Provider Detection

Weaver automatically uses Ollama when:

// From http_provider.go:442-450
case (strings.Contains(lowerModel, "ollama") || strings.HasPrefix(model, "ollama/")) && cfg.Providers.Ollama.APIKey != "":
  apiKey = cfg.Providers.Ollama.APIKey
  apiBase = cfg.Providers.Ollama.APIBase
  proxy = cfg.Providers.Ollama.Proxy
  if apiBase == "" {
    apiBase = "http://localhost:11434/v1"
  }

Model Name Stripping

// From http_provider.go:55-62
if idx := strings.Index(model, "/"); idx != -1 {
  prefix := model[:idx]
  if prefix == "moonshot" || prefix == "nvidia" || prefix == "groq" || prefix == "ollama" {
    model = model[idx+1:]
  }
}

This strips provider prefixes before sending to the API.

Hardware Requirements

Model Size Guidelines

Model Size	RAM Required	GPU VRAM	Example Models
7B params	8GB	6GB	Llama 3.1 8B, Mistral 7B
13-14B params	16GB	12GB	Qwen 2.5 14B
30-34B params	32GB	24GB	Mixtral 8x7B
70B params	64GB	48GB	Llama 3.1 70B

Performance Tips

Use GPU acceleration for faster inference
Quantize models (4-bit, 8-bit) to reduce memory usage
Use smaller models for development and testing
Batch requests when processing multiple prompts

Troubleshooting

Ollama Issues

# Check if Ollama is running
curl http://localhost:11434/v1/models

# Restart Ollama
ollama serve

# List installed models
ollama list

vLLM Issues

# Check server status
curl http://localhost:8000/v1/models

# Check logs
python -m vllm.entrypoints.openai.api_server --help

# Increase timeout for large models
export VLLM_TIMEOUT=600

Common Errors

Connection Refused

Verify the server is running
Check the port number is correct
Ensure no firewall is blocking the connection
Try curl http://localhost:PORT/v1/models

Out of Memory

Use a smaller model
Enable quantization (4-bit or 8-bit)
Reduce max_tokens in configuration
Close other applications

Slow Inference

Use GPU acceleration if available
Try a smaller model
Increase vLLM tensor parallel size
Reduce context window size

Model Not Found

For Ollama: Run ollama pull model-name
For vLLM: Verify model name matches HuggingFace
Check ollama list to see installed models

Privacy Benefits

Running models locally provides:

Complete privacy: Data never leaves your machine
No API costs: No per-token pricing
Offline operation: Works without internet
Full control: Customize models and parameters
No rate limits: Process as many requests as your hardware allows

LLM Providers

Chat Channels

​Supported Local Providers

​Ollama

​1. Install Ollama

​2. Pull Models

​3. Configure Weaver

​4. Usage

​Model Name Format

​vLLM

​1. Install vLLM

​2. Start Server

​3. Configure Weaver

​4. Usage

​Custom OpenAI-Compatible Endpoints

​Local Server Examples

​LocalAI

​LM Studio

​Jan

​Configuration Options

​Model Parameters

​Popular Local Models

​Llama Family

​Qwen Family

​Other Models

​Implementation Details

​Automatic Provider Detection

​Model Name Stripping

​Hardware Requirements

​Model Size Guidelines

​Performance Tips

​Troubleshooting

​Ollama Issues

​vLLM Issues

​Common Errors

​Privacy Benefits

​Next Steps

Provider Overview

Model Selection

Build docs developers (and LLMs) love

Supported Local Providers

Ollama

1. Install Ollama

2. Pull Models

3. Configure Weaver

4. Usage

Model Name Format

vLLM

1. Install vLLM

2. Start Server

3. Configure Weaver

4. Usage

Custom OpenAI-Compatible Endpoints

Local Server Examples

LocalAI

LM Studio

Jan

Configuration Options

Model Parameters

Popular Local Models

Llama Family

Qwen Family

Other Models

Implementation Details

Automatic Provider Detection

Model Name Stripping

Hardware Requirements

Model Size Guidelines

Performance Tips

Troubleshooting

Ollama Issues

vLLM Issues

Common Errors

Privacy Benefits

Next Steps