CLI Mode

The CLI mode provides non-interactive table output and JSON-formatted results for scripting and automation. Use subcommands for specific tasks or the --cli flag to get a fit table instead of the TUI.

CLI vs TUI

# Launch TUI (default)
llmfit

# Use CLI table output
llmfit --cli

# Use a subcommand (always CLI mode)
llmfit fit --perfect -n 5

Any subcommand automatically uses CLI mode. The --cli flag forces CLI mode when no subcommand is given.

Global Flags

These flags work with all subcommands:

--json

boolean

Output results as JSON instead of formatted tables. Useful for parsing with jq, scripts, or agents.

--memory

string

Override GPU VRAM size. Accepts suffixes: G/GB/GiB, M/MB/MiB, T/TB/TiB (case-insensitive).Examples: --memory 32G, --memory 24000M, --memory 1.5TUseful when GPU autodetection fails (VMs, passthrough, broken nvidia-smi).

--max-context

integer

Cap context length used for memory estimation (in tokens). Does not change the model’s advertised max context.Example: --max-context 8192Falls back to OLLAMA_CONTEXT_LENGTH environment variable if not set.

Subcommands

`system`

Show detected hardware specifications.

llmfit system

Output:

CPU: Intel(R) Core(TM) Ultra 7 165U (14 cores)
RAM: 41.1 GB available / 62.2 GB total
GPU: NVIDIA RTX 4090 (24.0 GB, CUDA)

JSON output:

llmfit --json system

{
  "total_ram_gb": 62.23,
  "available_ram_gb": 41.08,
  "cpu_cores": 14,
  "cpu_name": "Intel(R) Core(TM) Ultra 7 165U",
  "has_gpu": true,
  "gpu_vram_gb": 24.0,
  "gpu_name": "NVIDIA RTX 4090",
  "gpu_count": 1,
  "unified_memory": false,
  "backend": "CUDA"
}

--memory

string

Override detected GPU VRAM:

llmfit --memory 32G system

`list`

List all models in the database (no fit analysis).

llmfit list

Output: Table with columns: Name, Provider, Params, Quant, Context, Use Case, Released

`fit`

Find models that fit your system with ranked table output.

# All models, ranked by composite score
llmfit fit

# Only perfect fits
llmfit fit --perfect

# Top 5 models
llmfit fit -n 5

# Sort by parameter count, limit to 10
llmfit fit --sort params -n 10

# Sort by memory utilization
llmfit fit --sort mem

# Sort by context length
llmfit fit --sort ctx

# Sort by release date (newest first)
llmfit fit --sort date

Flags:

--perfect

boolean

Show only models that perfectly match recommended specs (GPU required, meets recommended VRAM/RAM).

-n, --limit

integer

Limit number of results returned.

--sort

enum

Sort column for fit output.Values: score (default), tps, params, mem, ctx, date, useAliases:

tps: tokens, toks, throughput
mem: memory, mem_pct, utilization
ctx: context
date: release, released
use: use_case, usecase

Output columns:

Fit: Indicator (●) colored by fit level
Model: Model name
Provider: Model provider
Params: Parameter count
Score: Composite score (0-100)
tok/s: Estimated tokens/second
Quant: Best quantization for your hardware
Mode: Run mode (GPU, MoE, CPU+GPU, CPU)
Mem %: Memory utilization percentage
Ctx: Context window (in thousands of tokens)
Fit Level: Perfect, Good, Marginal, Too Tight

JSON output:

llmfit --json fit -n 5

Returns array of model fit objects with full metadata, scoring breakdown, memory requirements, and runtime info.

`search`

Search for models by name, provider, or size (fuzzy match).

llmfit search "llama 8b"
llmfit search "mistral"
llmfit search "qwen coding"

Output: Table of matching models with basic metadata (no fit analysis).

`info`

Show detailed information about a specific model.

llmfit info "Llama-3.1-8B"
llmfit info "Mistral-7B-Instruct"

If multiple models match, you’ll be prompted to be more specific. Output:

Full model metadata
Score breakdown (Quality, Speed, Fit, Context)
Estimated tokens/second
Memory requirements (min/recommended VRAM/RAM)
Fit level and run mode for your system
MoE architecture details (if applicable)
GGUF download sources

JSON output:

llmfit --json info "Llama-3.1-8B"

Returns single-element array with full model fit object.

`plan`

Estimate required hardware for a specific model configuration. Inverts normal fit analysis: instead of “what fits my hardware?”, asks “what hardware is needed for this model config?”

# Basic plan with context length
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192

# Override quantization
llmfit plan "Llama-3.1-70B" --context 16384 --quant Q4_K_M

# Target specific throughput
llmfit plan "Mistral-7B" --context 8192 --target-tps 25

# JSON output for scripting
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --json

Flags:

model

string

required

Model selector (name or unique partial name). Resolves to a single model from the database.

--context

integer

required

Context length for estimation in tokens (e.g., 4096, 8192, 32768).

--quant

string

Quantization override (e.g., Q4_K_M, Q8_0, mlx-4bit). Omit for auto-selection based on context.

--target-tps

number

Target decode speed in tokens/second. Used to recommend GPU memory bandwidth.

Output: Human-readable:

Model name and configuration
Minimum hardware (VRAM, RAM, CPU cores)
Recommended hardware
Feasibility of GPU, CPU+GPU offload, CPU-only paths with estimated TPS and fit level
Upgrade deltas to reach better fit targets

JSON:

{
  "model": "Qwen/Qwen3-4B-MLX-4bit",
  "request": {
    "context": 8192,
    "quantization": "mlx-4bit",
    "target_tps": null
  },
  "minimum": {
    "vram_gb": 3.2,
    "ram_gb": 4.5,
    "cpu_cores": 4
  },
  "recommended": {
    "vram_gb": 6.4,
    "ram_gb": 9.0,
    "cpu_cores": 8
  },
  "run_paths": [
    {
      "path": "gpu",
      "feasible": true,
      "estimated_tps": 42.3,
      "fit_level": "good"
    },
    {
      "path": "cpu_offload",
      "feasible": true,
      "estimated_tps": 21.1,
      "fit_level": "marginal"
    },
    {
      "path": "cpu_only",
      "feasible": true,
      "estimated_tps": 12.7,
      "fit_level": "marginal"
    }
  ],
  "upgrade_deltas": []
}

Get top model recommendations for your hardware (JSON-friendly).

# Top 5 recommendations (default)
llmfit recommend

# Top 10
llmfit recommend --limit 10

# Filter by use case
llmfit recommend --use-case coding --limit 3
llmfit recommend --use-case reasoning --limit 5

# Minimum fit level
llmfit recommend --min-fit good --limit 5
llmfit recommend --min-fit perfect --limit 3

# Filter by runtime
llmfit recommend --runtime mlx --limit 5
llmfit recommend --runtime llamacpp --limit 5

# JSON output (default for recommend)
llmfit recommend --json --limit 5

Flags:

--limit, -n

integer

default:"5"

Maximum number of recommendations to return.

--use-case

enum

Filter by use case category.Values: general, coding, reasoning, chat, multimodal, embedding

--min-fit

enum

default:"marginal"

Minimum fit level to include.Values: perfect, good, marginal

--runtime

enum

default:"any"

Filter by inference runtime.Values: any, mlx, llamacpp

--json

boolean

default:"true"

Output as JSON (default for recommend command).

Output: JSON array of model fit objects ranked by composite score, filtered by specified criteria. Perfect for agent/script consumption.

`download`

Download a GGUF model from HuggingFace for use with llama.cpp.

# Download by HuggingFace repo
llmfit download "bartowski/Llama-3.1-8B-Instruct-GGUF"

# Auto-select quantization based on hardware
llmfit download "llama 8b"

# Specify quantization
llmfit download "llama 8b" --quant Q4_K_M

# Set memory budget for auto-selection
llmfit download "mistral 7b" --budget 12

# List available files without downloading
llmfit download "bartowski/Mistral-7B-Instruct-GGUF" --list

Flags:

model

string

required

Model to download. Can be:

HuggingFace repo (e.g., bartowski/Llama-3.1-8B-Instruct-GGUF)
Search query (e.g., llama 8b)
Known model name (e.g., llama-3.1-8b-instruct)

--quant, -q

string

Specific GGUF quantization to download (e.g., Q4_K_M, Q8_0). If omitted, selects the best quantization that fits your hardware.

--budget

number

Maximum memory budget in GB for auto-selection. Defaults to detected GPU VRAM or available RAM.

--list

boolean

List available GGUF files in the repo without downloading.

Output: Progress bar with download status, final file path, and run instructions.

`hf-search`

Search HuggingFace for GGUF models compatible with llama.cpp.

llmfit hf-search "llama"
llmfit hf-search "mistral" --limit 20

Flags:

query

string

required

Search query (model name, architecture, etc.)

--limit, -n

integer

default:"10"

Maximum number of results to return.

Output: Table of HuggingFace repositories with GGUF files, including repo ID and model type.

`run`

Run a downloaded GGUF model with llama-cli or llama-server.

# Interactive chat
llmfit run "llama-3.1-8b"

# Start as OpenAI-compatible API server
llmfit run "mistral-7b" --server --port 8080

# Custom context size and GPU layers
llmfit run "llama-3.1-8b" --ctx-size 8192 --ngl 35

Flags:

model

string

required

Model file or name. If a name is given, searches the local llama.cpp cache directory.

--server

boolean

Run as an OpenAI-compatible API server instead of interactive chat.

--port

integer

default:"8080"

Port for the API server (only with --server).

--ngl, -g

integer

default:"-1"

Number of GPU layers to offload. -1 means all layers (full GPU).

--ctx-size, -c

integer

default:"4096"

Context size in tokens.

Output: Launches llama-cli (interactive) or llama-server (API mode) with the specified model and configuration.

`serve`

Start llmfit REST API server for cluster/node scheduling workflows. See REST API Guide for full endpoint documentation.

# Start on default port (8787)
llmfit serve

# Bind to specific host and port
llmfit serve --host 0.0.0.0 --port 8787

# With global flags
llmfit --memory 24G --max-context 8192 serve --host 0.0.0.0 --port 8787

Flags:

--host

string

default:"0.0.0.0"

Host interface to bind.

--port

integer

default:"8787"

Port to listen on.

Output: HTTP server running on specified host/port. Access endpoints:

GET /health - Liveness probe
GET /api/v1/system - Node hardware info
GET /api/v1/models - Full fit list with filters
GET /api/v1/models/top - Top runnable models for scheduling

See REST API Guide for query parameters and response schemas.

Examples

Find top 5 perfectly fitting models

llmfit fit --perfect -n 5

Get JSON recommendations for coding

llmfit recommend --json --use-case coding --limit 3 | jq '.models[].name'

Search and get info

llmfit search "mistral" | grep 7B
llmfit info "Mistral-7B-Instruct-v0.3"

Override GPU memory and cap context

llmfit --memory 32G --max-context 16384 fit --perfect

Plan hardware for a workload

llmfit plan "DeepSeek-R1-Distill-Llama-70B" --context 32768 --quant Q4_K_M --json

Download and run a model

llmfit download "llama 8b" --quant Q4_K_M
llmfit run "llama-3.1-8b"

JSON Schema Reference

All --json output follows consistent schemas: System (llmfit --json system):

{
  total_ram_gb: number,
  available_ram_gb: number,
  cpu_cores: number,
  cpu_name: string,
  has_gpu: boolean,
  gpu_vram_gb: number | null,
  gpu_name: string | null,
  gpu_count: number,
  unified_memory: boolean,
  backend: string,
  gpus: Array<{
    name: string,
    vram_gb: number | null,
    count: number,
    unified_memory: boolean,
    backend: string
  }>
}

Model Fit (llmfit --json fit, recommend, info):

{
  models: Array<{
    name: string,
    provider: string,
    parameter_count: string,
    params_b: number,
    context_length: number,
    use_case: string,
    category: string,
    release_date: string | null,
    is_moe: boolean,
    fit_level: "perfect" | "good" | "marginal" | "too_tight",
    fit_label: string,
    run_mode: "gpu" | "moe" | "cpu_offload" | "cpu_only",
    run_mode_label: string,
    score: number,
    score_components: {
      quality: number,
      speed: number,
      fit: number,
      context: number
    },
    estimated_tps: number,
    runtime: "mlx" | "llamacpp",
    runtime_label: string,
    best_quant: string,
    memory_required_gb: number,
    memory_available_gb: number,
    utilization_pct: number,
    notes: string[],
    gguf_sources: Array<{
      provider: string,
      repo: string
    }>
  }>
}

Plan (llmfit --json plan):

{
  model: string,
  request: {
    context: number,
    quantization: string | null,
    target_tps: number | null
  },
  minimum: {
    vram_gb: number | null,
    ram_gb: number,
    cpu_cores: number
  },
  recommended: {
    vram_gb: number | null,
    ram_gb: number,
    cpu_cores: number
  },
  run_paths: Array<{
    path: "gpu" | "cpu_offload" | "cpu_only",
    feasible: boolean,
    estimated_tps: number | null,
    fit_level: "perfect" | "good" | "marginal" | "too_tight" | null
  }>,
  upgrade_deltas: Array<{
    description: string
  }>
}

Pipe JSON output to jq for filtering and formatting:

llmfit --json fit -n 5 | jq '.models[] | {name, score, fit_level}'

The --json flag is global and works with system, fit, info, plan, and recommend subcommands.

Get Started

Core Concepts

Guides

Platform Support

CLI vs TUI

Global Flags

Subcommands

`system`

`list`

`fit`

`search`

`info`

`plan`

`download`

`hf-search`

`run`

`serve`

Examples

Find top 5 perfectly fitting models

Get JSON recommendations for coding

Search and get info

Override GPU memory and cap context

Plan hardware for a workload

Download and run a model

JSON Schema Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Platform Support

​CLI vs TUI

​Global Flags

​Subcommands

​system

​list

​fit

​search

​info

​plan

​recommend

​download

​hf-search

​run

​serve

​Examples

​Find top 5 perfectly fitting models

​Get JSON recommendations for coding

​Search and get info

​Override GPU memory and cap context

​Plan hardware for a workload

​Download and run a model

​JSON Schema Reference

Build docs developers (and LLMs) love

CLI vs TUI

Global Flags

Subcommands

`system`

`list`

`fit`

`search`

`info`

`plan`

`recommend`

`download`

`hf-search`

`run`

`serve`

Examples

Find top 5 perfectly fitting models

Get JSON recommendations for coding

Search and get info

Override GPU memory and cap context

Plan hardware for a workload

Download and run a model

JSON Schema Reference