Goose supports local inference using llama.cpp, enabling offline usage, data privacy, and cost savings. This guide covers setup, configuration, and optimization.
Overview
Local inference allows you to:
- Run offline: No internet required once models are downloaded
- Preserve privacy: Data never leaves your machine
- Eliminate API costs: No per-token charges
- Customize models: Use fine-tuned or specialized models
- Control resources: Manage GPU/CPU usage precisely
Quick Start
1. Install Prerequisites
macOS/Linux:
# llama.cpp is bundled with Goose
# Just ensure you have enough disk space for models
GPU Acceleration (Optional):
# CUDA (NVIDIA)
sudo apt install nvidia-cuda-toolkit
# ROCm (AMD)
sudo apt install rocm-hip-runtime
# Metal (macOS)
# Built-in, no installation needed
2. Download a Model
Goose supports GGUF format models from Hugging Face:
# Download a recommended model (Qwen 2.5 Coder)
goose download-model qwen2.5-coder-7b-instruct
# Or specify a Hugging Face repo
goose download-model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
Models are stored in ~/.cache/goose/models/.
# Set provider to local
goose configure set GOOSE_PROVIDER local
# Specify model
goose configure set GOOSE_MODEL qwen2.5-coder-7b-instruct
4. Run Goose
Goose will automatically:
- Load the model into memory
- Allocate GPU/CPU resources
- Initialize the inference engine
Supported Models
Goose works with any GGUF model, but these are recommended:
Coding Models
| Model | Size | Context | Best For |
|---|
| Qwen 2.5 Coder 7B | 7B | 32K | General coding, fast |
| Qwen 2.5 Coder 14B | 14B | 32K | Complex tasks, accurate |
| DeepSeek Coder V2 16B | 16B | 128K | Long context, architecture |
| CodeLlama 13B | 13B | 16K | Python, legacy |
General Purpose
| Model | Size | Context | Best For |
|---|
| Llama 3.1 8B | 8B | 128K | Balanced performance |
| Mistral 7B | 7B | 32K | Fast, efficient |
| Phi-3 Mini | 3.8B | 128K | Low memory |
Model sizes are approximate. Quantized versions (Q4, Q5, Q6) reduce memory usage at the cost of slight accuracy loss.
Quantization Levels
GGUF models come in various quantization levels:
| Quantization | Size | Quality | Speed | Recommended |
|---|
| Q8_0 | 100% | Best | Slow | High VRAM |
| Q6_K | 75% | Excellent | Medium | Balanced |
| Q5_K_M | 65% | Very Good | Fast | Recommended |
| Q4_K_M | 50% | Good | Faster | Low VRAM |
| Q3_K_M | 40% | Acceptable | Fastest | Constrained |
Example: Download specific quantization:
goose download-model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --quant Q5_K_M
Configuration
Model Settings
Configure in ~/.config/goose/config.yaml:
GOOSE_PROVIDER: local
GOOSE_MODEL: qwen2.5-coder-7b-instruct
# Local inference settings
local_inference:
# Context size (tokens)
context_size: 8192
# Generation settings
temperature: 0.7
top_p: 0.95
top_k: 40
repeat_penalty: 1.1
# Performance
n_batch: 512
n_threads: 8
flash_attention: true
# GPU offload (0 = CPU only, -1 = all layers)
n_gpu_layers: -1
Environment Variables
# Force CPU inference
export GOOSE_LOCAL_CPU_ONLY=1
# Set GPU layers manually
export GOOSE_N_GPU_LAYERS=32
# Model cache directory
export GOOSE_MODEL_CACHE=~/.cache/goose/models
Memory Management
Goose automatically estimates memory requirements and adjusts context size.
Memory Estimation
From crates/goose/src/providers/local_inference/inference_engine.rs:
pub fn estimate_max_context_for_memory(
model: &LlamaModel,
runtime: &InferenceRuntime,
) -> Option<usize> {
let available = available_inference_memory_bytes(runtime);
// Reserve 50% for computation buffers
let usable = (available as f64 * 0.5) as u64;
// Calculate KV cache size per token
let bytes_per_token = (k_per_head + v_per_head) * n_head_kv * n_layer * 2;
Some((usable / bytes_per_token) as usize)
}
Memory Requirements
Typical requirements for Q5_K_M quantization:
| Model Size | Model RAM | Context (8K) | Context (32K) | Total |
|---|
| 3B | 2.5 GB | 0.5 GB | 2 GB | 3-5 GB |
| 7B | 5 GB | 1 GB | 4 GB | 6-9 GB |
| 13B | 9 GB | 2 GB | 8 GB | 11-17 GB |
| 34B | 22 GB | 5 GB | 20 GB | 27-42 GB |
If your prompt exceeds available memory, Goose will return an error: “Prompt exceeds estimated memory capacity”. Reduce context size or use a smaller model.
GPU Acceleration
Check GPU usage:
# NVIDIA
nvidia-smi
# AMD
rocm-smi
# macOS
sudo powermetrics --samplers gpu_power
Optimize GPU layers:
# Start with all layers
n_gpu_layers: -1
# If OOM, reduce gradually
n_gpu_layers: 32 # Offload 32 layers to GPU
CPU Optimization
# Use all CPU cores
n_threads: -1 # Auto-detect
# Or specify manually
n_threads: 16
# Batch size (larger = faster, more memory)
n_batch: 512
Flash Attention
Enable for 2-3x faster inference on supported hardware:
Requires:
- CUDA compute capability ≥ 7.0 (RTX 20 series+)
- Metal (macOS M1+)
- ROCm 5.0+
Local models support tool calling through two modes:
Models trained with native tool calling (e.g., Qwen 2.5):
// Automatically detected from model metadata
if model.supports_native_tools() {
// Use built-in tool calling
}
For models without native support, Goose emulates tool calling:
// Inject tool definitions into system prompt
let prompt_with_tools = format!(
"{}\n\nAvailable tools:\n{}",
system_prompt,
tool_definitions
);
Native tool calling is more reliable. Choose models like Qwen 2.5, Mistral, or Llama 3.1 for best results.
Sampling Strategies
Temperature Sampling (Default)
Balanced creativity and coherence:
sampling:
type: temperature
temperature: 0.7 # 0 = deterministic, 1 = creative
top_p: 0.95 # Nucleus sampling
top_k: 40 # Top-k sampling
min_p: 0.05 # Minimum probability
Greedy Sampling
Always select most likely token (deterministic):
Mirostat v2
Adaptive sampling for consistent perplexity:
sampling:
type: mirostat_v2
tau: 5.0 # Target perplexity
eta: 0.1 # Learning rate
Troubleshooting
Model won’t load
# Verify model file exists
ls ~/.cache/goose/models/
# Check format (must be .gguf)
file ~/.cache/goose/models/qwen2.5-coder-7b-instruct.gguf
# Re-download if corrupted
goose download-model qwen2.5-coder-7b-instruct --force
Out of Memory (OOM)
# Reduce context size
context_size: 4096 # Instead of 32768
# Use smaller quantization
# Download Q4_K_M instead of Q5_K_M
# Reduce GPU layers
n_gpu_layers: 24 # Instead of -1
Slow generation
# Increase batch size
n_batch: 1024 # Instead of 512
# Enable flash attention
flash_attention: true
# Use all CPU cores
n_threads: -1
# Offload more to GPU
n_gpu_layers: -1
Poor quality responses
# Use higher quantization (Q6 or Q8)
# Increase temperature
temperature: 0.8
# Adjust penalties
repeat_penalty: 1.15
frequency_penalty: 0.1
Advanced: Custom Model Registry
Define custom models in ~/.config/goose/local_models.yaml:
models:
my-custom-model:
path: /path/to/model.gguf
context_size: 16384
temperature: 0.7
n_gpu_layers: -1
description: "My fine-tuned model"
Use it:
goose configure set GOOSE_MODEL my-custom-model
Implementation Details
Source Code
- Engine:
crates/goose/src/providers/local_inference/inference_engine.rs
- Model registry:
crates/goose/src/providers/local_inference/local_model_registry.rs
- Native tools:
crates/goose/src/providers/local_inference/inference_native_tools.rs
- Emulated tools:
crates/goose/src/providers/local_inference/inference_emulated_tools.rs
- Hugging Face models:
crates/goose/src/providers/local_inference/hf_models.rs
llama.cpp Integration
Goose uses the llama-cpp-2 Rust bindings:
use llama_cpp_2::{
context::LlamaContext,
model::LlamaModel,
sampling::LlamaSampler,
};
// Load model
let model = LlamaModel::load_from_file(path, params)?;
// Create context
let ctx = model.new_context(backend, ctx_params)?;
// Generate
let sampler = LlamaSampler::chain_simple(samplers);
let token = sampler.sample(&ctx, -1);
Resources