Skip to main content

Overview

ollama-plan reads your installed Ollama models and hardware profile, then computes safe values for OLLAMA_NUM_CTX, OLLAMA_NUM_PARALLEL, and OLLAMA_MAX_LOADED_MODELS. It prevents out-of-memory crashes by planning memory usage before you start Ollama.
llm-checker ollama-plan

Example Output

OLLAMA CAPACITY PLAN
Hardware: Metal (metal)
Memory budget: 20GB usable (reserve 2GB)

Selected models:
  - qwen2.5-coder:14b (14B, ~9.1GB base)
  - llama3.2:3b (3B, ~2.0GB base)

Recommended envelope:
  Context: 8192 (requested 8192)
  Parallel: 2 (requested 2)
  Loaded models: 2 (requested 2)
  Estimated memory: 15.2GB / 20GB (76%)
  Risk: LOW (18/100)

Notes:
  - Running both models simultaneously fits within your memory budget

Recommended env vars:
  export OLLAMA_NUM_CTX=8192
  export OLLAMA_NUM_PARALLEL=2
  export OLLAMA_MAX_LOADED_MODELS=2

Fallback profile:
  OLLAMA_NUM_CTX=4096 OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1

Flags

--models
string[]
Model tags or family names to include in the plan. Matches against installed Ollama models by exact name, prefix, family, or substring. If omitted, all local models are included.
--ctx
number
Target context window in tokens.Default: 8192
--concurrency
number
Target number of parallel requests to support.Default: 2
--objective
string
Optimization objective. Accepted values: latency, balanced, throughput.Default: balanced
--reserve-gb
number
Memory to reserve for the OS and background processes (GB).Default: 2
--json
flag
Output the full capacity plan as JSON.

Usage Examples

# Plan for all installed models
llm-checker ollama-plan

# Plan for specific models only
llm-checker ollama-plan --models qwen2.5-coder:14b llama3.2:3b

# Optimize for low latency with a larger context window
llm-checker ollama-plan --objective latency --ctx 16384

# Throughput-optimized plan with 4 concurrent requests
llm-checker ollama-plan --objective throughput --concurrency 4

# Reserve 4GB for other workloads
llm-checker ollama-plan --reserve-gb 4

# Machine-readable output
llm-checker ollama-plan --json

Plan Output Fields

FieldDescription
Memory budgetUsable RAM after subtracting the OS reserve
ContextRecommended OLLAMA_NUM_CTX value
ParallelRecommended OLLAMA_NUM_PARALLEL value
Loaded modelsRecommended OLLAMA_MAX_LOADED_MODELS value
Estimated memoryProjected memory usage and utilization percentage
RiskRisk level (LOW, MEDIUM, HIGH) and score (0–100)
Fallback profileConservative fallback values for constrained environments

Applying the Plan

Copy the export lines into your shell profile or set them before starting Ollama:
export OLLAMA_NUM_CTX=8192
export OLLAMA_NUM_PARALLEL=2
export OLLAMA_MAX_LOADED_MODELS=2
ollama serve
Or apply them inline for a single session:
OLLAMA_NUM_CTX=8192 OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=2 ollama serve
Ollama must be running and you must have at least one model installed before running ollama-plan. Install a model with ollama pull llama3.2:3b if needed.

Build docs developers (and LLMs) love