--cli flag to get a fit table instead of the TUI.
CLI vs TUI
--cli flag forces CLI mode when no subcommand is given.
Global Flags
These flags work with all subcommands:Output results as JSON instead of formatted tables. Useful for parsing with
jq, scripts, or agents.Override GPU VRAM size. Accepts suffixes:
G/GB/GiB, M/MB/MiB, T/TB/TiB (case-insensitive).Examples: --memory 32G, --memory 24000M, --memory 1.5TUseful when GPU autodetection fails (VMs, passthrough, broken nvidia-smi).Cap context length used for memory estimation (in tokens). Does not change the model’s advertised max context.Example:
--max-context 8192Falls back to OLLAMA_CONTEXT_LENGTH environment variable if not set.Subcommands
system
Show detected hardware specifications.
Override detected GPU VRAM:
list
List all models in the database (no fit analysis).
fit
Find models that fit your system with ranked table output.
Show only models that perfectly match recommended specs (GPU required, meets recommended VRAM/RAM).
Limit number of results returned.
Sort column for fit output.Values:
score (default), tps, params, mem, ctx, date, useAliases:tps:tokens,toks,throughputmem:memory,mem_pct,utilizationctx:contextdate:release,releaseduse:use_case,usecase
- Fit: Indicator (
●) colored by fit level - Model: Model name
- Provider: Model provider
- Params: Parameter count
- Score: Composite score (0-100)
- tok/s: Estimated tokens/second
- Quant: Best quantization for your hardware
- Mode: Run mode (GPU, MoE, CPU+GPU, CPU)
- Mem %: Memory utilization percentage
- Ctx: Context window (in thousands of tokens)
- Fit Level: Perfect, Good, Marginal, Too Tight
search
Search for models by name, provider, or size (fuzzy match).
info
Show detailed information about a specific model.
- Full model metadata
- Score breakdown (Quality, Speed, Fit, Context)
- Estimated tokens/second
- Memory requirements (min/recommended VRAM/RAM)
- Fit level and run mode for your system
- MoE architecture details (if applicable)
- GGUF download sources
plan
Estimate required hardware for a specific model configuration. Inverts normal fit analysis: instead of “what fits my hardware?”, asks “what hardware is needed for this model config?”
Model selector (name or unique partial name). Resolves to a single model from the database.
Context length for estimation in tokens (e.g., 4096, 8192, 32768).
Quantization override (e.g., Q4_K_M, Q8_0, mlx-4bit). Omit for auto-selection based on context.
Target decode speed in tokens/second. Used to recommend GPU memory bandwidth.
- Model name and configuration
- Minimum hardware (VRAM, RAM, CPU cores)
- Recommended hardware
- Feasibility of GPU, CPU+GPU offload, CPU-only paths with estimated TPS and fit level
- Upgrade deltas to reach better fit targets
recommend
Get top model recommendations for your hardware (JSON-friendly).
Maximum number of recommendations to return.
Filter by use case category.Values:
general, coding, reasoning, chat, multimodal, embeddingMinimum fit level to include.Values:
perfect, good, marginalFilter by inference runtime.Values:
any, mlx, llamacppOutput as JSON (default for recommend command).
download
Download a GGUF model from HuggingFace for use with llama.cpp.
Model to download. Can be:
- HuggingFace repo (e.g.,
bartowski/Llama-3.1-8B-Instruct-GGUF) - Search query (e.g.,
llama 8b) - Known model name (e.g.,
llama-3.1-8b-instruct)
Specific GGUF quantization to download (e.g., Q4_K_M, Q8_0). If omitted, selects the best quantization that fits your hardware.
Maximum memory budget in GB for auto-selection. Defaults to detected GPU VRAM or available RAM.
List available GGUF files in the repo without downloading.
hf-search
Search HuggingFace for GGUF models compatible with llama.cpp.
Search query (model name, architecture, etc.)
Maximum number of results to return.
run
Run a downloaded GGUF model with llama-cli or llama-server.
Model file or name. If a name is given, searches the local llama.cpp cache directory.
Run as an OpenAI-compatible API server instead of interactive chat.
Port for the API server (only with
--server).Number of GPU layers to offload.
-1 means all layers (full GPU).Context size in tokens.
serve
Start llmfit REST API server for cluster/node scheduling workflows. See REST API Guide for full endpoint documentation.
Host interface to bind.
Port to listen on.
GET /health- Liveness probeGET /api/v1/system- Node hardware infoGET /api/v1/models- Full fit list with filtersGET /api/v1/models/top- Top runnable models for scheduling
Examples
Find top 5 perfectly fitting models
Get JSON recommendations for coding
Search and get info
Override GPU memory and cap context
Plan hardware for a workload
Download and run a model
JSON Schema Reference
All--json output follows consistent schemas:
System (llmfit --json system):
llmfit --json fit, recommend, info):
llmfit --json plan):
The
--json flag is global and works with system, fit, info, plan, and recommend subcommands.