Skip to main content
Goose supports local inference using llama.cpp, enabling offline usage, data privacy, and cost savings. This guide covers setup, configuration, and optimization.

Overview

Local inference allows you to:
  • Run offline: No internet required once models are downloaded
  • Preserve privacy: Data never leaves your machine
  • Eliminate API costs: No per-token charges
  • Customize models: Use fine-tuned or specialized models
  • Control resources: Manage GPU/CPU usage precisely

Quick Start

1. Install Prerequisites

macOS/Linux:
# llama.cpp is bundled with Goose
# Just ensure you have enough disk space for models
GPU Acceleration (Optional):
# CUDA (NVIDIA)
sudo apt install nvidia-cuda-toolkit

# ROCm (AMD)
sudo apt install rocm-hip-runtime

# Metal (macOS)
# Built-in, no installation needed

2. Download a Model

Goose supports GGUF format models from Hugging Face:
# Download a recommended model (Qwen 2.5 Coder)
goose download-model qwen2.5-coder-7b-instruct

# Or specify a Hugging Face repo
goose download-model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
Models are stored in ~/.cache/goose/models/.

3. Configure Goose

# Set provider to local
goose configure set GOOSE_PROVIDER local

# Specify model
goose configure set GOOSE_MODEL qwen2.5-coder-7b-instruct

4. Run Goose

goose session start
Goose will automatically:
  1. Load the model into memory
  2. Allocate GPU/CPU resources
  3. Initialize the inference engine

Supported Models

Goose works with any GGUF model, but these are recommended:

Coding Models

ModelSizeContextBest For
Qwen 2.5 Coder 7B7B32KGeneral coding, fast
Qwen 2.5 Coder 14B14B32KComplex tasks, accurate
DeepSeek Coder V2 16B16B128KLong context, architecture
CodeLlama 13B13B16KPython, legacy

General Purpose

ModelSizeContextBest For
Llama 3.1 8B8B128KBalanced performance
Mistral 7B7B32KFast, efficient
Phi-3 Mini3.8B128KLow memory
Model sizes are approximate. Quantized versions (Q4, Q5, Q6) reduce memory usage at the cost of slight accuracy loss.

Quantization Levels

GGUF models come in various quantization levels:
QuantizationSizeQualitySpeedRecommended
Q8_0100%BestSlowHigh VRAM
Q6_K75%ExcellentMediumBalanced
Q5_K_M65%Very GoodFastRecommended
Q4_K_M50%GoodFasterLow VRAM
Q3_K_M40%AcceptableFastestConstrained
Example: Download specific quantization:
goose download-model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --quant Q5_K_M

Configuration

Model Settings

Configure in ~/.config/goose/config.yaml:
GOOSE_PROVIDER: local
GOOSE_MODEL: qwen2.5-coder-7b-instruct

# Local inference settings
local_inference:
  # Context size (tokens)
  context_size: 8192
  
  # Generation settings
  temperature: 0.7
  top_p: 0.95
  top_k: 40
  repeat_penalty: 1.1
  
  # Performance
  n_batch: 512
  n_threads: 8
  flash_attention: true
  
  # GPU offload (0 = CPU only, -1 = all layers)
  n_gpu_layers: -1

Environment Variables

# Force CPU inference
export GOOSE_LOCAL_CPU_ONLY=1

# Set GPU layers manually
export GOOSE_N_GPU_LAYERS=32

# Model cache directory
export GOOSE_MODEL_CACHE=~/.cache/goose/models

Memory Management

Goose automatically estimates memory requirements and adjusts context size.

Memory Estimation

From crates/goose/src/providers/local_inference/inference_engine.rs:
pub fn estimate_max_context_for_memory(
    model: &LlamaModel,
    runtime: &InferenceRuntime,
) -> Option<usize> {
    let available = available_inference_memory_bytes(runtime);
    
    // Reserve 50% for computation buffers
    let usable = (available as f64 * 0.5) as u64;
    
    // Calculate KV cache size per token
    let bytes_per_token = (k_per_head + v_per_head) * n_head_kv * n_layer * 2;
    
    Some((usable / bytes_per_token) as usize)
}

Memory Requirements

Typical requirements for Q5_K_M quantization:
Model SizeModel RAMContext (8K)Context (32K)Total
3B2.5 GB0.5 GB2 GB3-5 GB
7B5 GB1 GB4 GB6-9 GB
13B9 GB2 GB8 GB11-17 GB
34B22 GB5 GB20 GB27-42 GB
If your prompt exceeds available memory, Goose will return an error: “Prompt exceeds estimated memory capacity”. Reduce context size or use a smaller model.

Performance Tuning

GPU Acceleration

Check GPU usage:
# NVIDIA
nvidia-smi

# AMD
rocm-smi

# macOS
sudo powermetrics --samplers gpu_power
Optimize GPU layers:
# Start with all layers
n_gpu_layers: -1

# If OOM, reduce gradually
n_gpu_layers: 32  # Offload 32 layers to GPU

CPU Optimization

# Use all CPU cores
n_threads: -1  # Auto-detect

# Or specify manually
n_threads: 16

# Batch size (larger = faster, more memory)
n_batch: 512

Flash Attention

Enable for 2-3x faster inference on supported hardware:
flash_attention: true
Requires:
  • CUDA compute capability ≥ 7.0 (RTX 20 series+)
  • Metal (macOS M1+)
  • ROCm 5.0+

Tool Support

Local models support tool calling through two modes:

Native Tools (Preferred)

Models trained with native tool calling (e.g., Qwen 2.5):
// Automatically detected from model metadata
if model.supports_native_tools() {
    // Use built-in tool calling
}

Emulated Tools

For models without native support, Goose emulates tool calling:
// Inject tool definitions into system prompt
let prompt_with_tools = format!(
    "{}\n\nAvailable tools:\n{}",
    system_prompt,
    tool_definitions
);
Native tool calling is more reliable. Choose models like Qwen 2.5, Mistral, or Llama 3.1 for best results.

Sampling Strategies

Temperature Sampling (Default)

Balanced creativity and coherence:
sampling:
  type: temperature
  temperature: 0.7  # 0 = deterministic, 1 = creative
  top_p: 0.95       # Nucleus sampling
  top_k: 40         # Top-k sampling
  min_p: 0.05       # Minimum probability

Greedy Sampling

Always select most likely token (deterministic):
sampling:
  type: greedy

Mirostat v2

Adaptive sampling for consistent perplexity:
sampling:
  type: mirostat_v2
  tau: 5.0   # Target perplexity
  eta: 0.1   # Learning rate

Troubleshooting

Model won’t load

# Verify model file exists
ls ~/.cache/goose/models/

# Check format (must be .gguf)
file ~/.cache/goose/models/qwen2.5-coder-7b-instruct.gguf

# Re-download if corrupted
goose download-model qwen2.5-coder-7b-instruct --force

Out of Memory (OOM)

# Reduce context size
context_size: 4096  # Instead of 32768

# Use smaller quantization
# Download Q4_K_M instead of Q5_K_M

# Reduce GPU layers
n_gpu_layers: 24  # Instead of -1

Slow generation

# Increase batch size
n_batch: 1024  # Instead of 512

# Enable flash attention
flash_attention: true

# Use all CPU cores
n_threads: -1

# Offload more to GPU
n_gpu_layers: -1

Poor quality responses

# Use higher quantization (Q6 or Q8)
# Increase temperature
temperature: 0.8

# Adjust penalties
repeat_penalty: 1.15
frequency_penalty: 0.1

Advanced: Custom Model Registry

Define custom models in ~/.config/goose/local_models.yaml:
models:
  my-custom-model:
    path: /path/to/model.gguf
    context_size: 16384
    temperature: 0.7
    n_gpu_layers: -1
    description: "My fine-tuned model"
Use it:
goose configure set GOOSE_MODEL my-custom-model

Implementation Details

Source Code

  • Engine: crates/goose/src/providers/local_inference/inference_engine.rs
  • Model registry: crates/goose/src/providers/local_inference/local_model_registry.rs
  • Native tools: crates/goose/src/providers/local_inference/inference_native_tools.rs
  • Emulated tools: crates/goose/src/providers/local_inference/inference_emulated_tools.rs
  • Hugging Face models: crates/goose/src/providers/local_inference/hf_models.rs

llama.cpp Integration

Goose uses the llama-cpp-2 Rust bindings:
use llama_cpp_2::{
    context::LlamaContext,
    model::LlamaModel,
    sampling::LlamaSampler,
};

// Load model
let model = LlamaModel::load_from_file(path, params)?;

// Create context
let ctx = model.new_context(backend, ctx_params)?;

// Generate
let sampler = LlamaSampler::chain_simple(samplers);
let token = sampler.sample(&ctx, -1);

Resources

Build docs developers (and LLMs) love