REST API

The llmfit serve command starts an HTTP API that exposes node-local model fit analysis. This is designed for cluster schedulers, aggregators, and remote clients that need to query hardware compatibility across multiple nodes.

Starting the Server

# Default: bind to 0.0.0.0:8787
llmfit serve

# Custom host and port
llmfit serve --host 127.0.0.1 --port 8080

# With global flags (memory override, context cap)
llmfit --memory 24G --max-context 8192 serve --host 0.0.0.0 --port 8787

--host

string

default:"0.0.0.0"

Host interface to bind. Use 0.0.0.0 to accept connections from any interface, or 127.0.0.1 for localhost-only.

--port

integer

default:"8787"

Port to listen on.

Global flags (--memory, --max-context) apply to all API responses, overriding hardware detection and memory estimation.

Base URL

Default local base URL:

http://127.0.0.1:8787

For remote access, replace 127.0.0.1 with the server’s IP or hostname.

Endpoints

`GET /health`

Liveness probe. Returns a simple status object indicating the server is running. Response:

{
  "status": "ok",
  "node": {
    "name": "worker-1",
    "os": "linux"
  }
}

Status codes:

200 OK - Server is healthy

Example:

curl http://localhost:8787/health

`GET /api/v1/system`

Node identity and hardware specs. Returns detected CPU, RAM, GPU, and backend info. Response:

{
  "node": {
    "name": "worker-1",
    "os": "linux"
  },
  "system": {
    "total_ram_gb": 62.23,
    "available_ram_gb": 41.08,
    "cpu_cores": 14,
    "cpu_name": "Intel(R) Core(TM) Ultra 7 165U",
    "has_gpu": true,
    "gpu_vram_gb": 24.0,
    "gpu_name": "NVIDIA RTX 4090",
    "gpu_count": 1,
    "unified_memory": false,
    "backend": "CUDA",
    "gpus": [
      {
        "name": "NVIDIA RTX 4090",
        "vram_gb": 24.0,
        "count": 1,
        "unified_memory": false,
        "backend": "CUDA"
      }
    ]
  }
}

Fields:

node.name

string

Node hostname

node.os

string

Operating system (linux, macos, windows)

system.total_ram_gb

number

Total system RAM in gigabytes

system.available_ram_gb

number

Available system RAM in gigabytes

system.cpu_cores

integer

Total CPU core count

system.cpu_name

string

CPU model name

system.has_gpu

boolean

Whether at least one GPU is detected

system.gpu_vram_gb

number | null

Total GPU VRAM across all detected GPUs (null if no GPU)

system.gpu_name

string | null

Primary GPU model name (null if no GPU)

system.gpu_count

integer

Number of detected GPUs

system.unified_memory

boolean

Whether the system uses unified memory (Apple Silicon)

system.backend

string

Acceleration backend: CUDA, Metal, ROCm, SYCL, CPU (x86), CPU (ARM), Ascend

system.gpus

array

Array of detected GPU objects with per-GPU details

Status codes:

200 OK - System info returned

Example:

curl http://localhost:8787/api/v1/system | jq '.system'

`GET /api/v1/models`

Filtered and sorted model fit list. Returns an array of models with fit analysis, scoring, and runtime info. Supports extensive query parameters for filtering. Query Parameters:

integer

Maximum number of models to return. Alias: n

boolean

When true, only return models with perfect fit level (overrides min_fit)

enum

Minimum fit level to include.Values: perfect, good, marginal, too_tightDefault: marginal (includes all but too_tight)

enum

Filter by inference runtime.Values: any, mlx, llamacppDefault: any

enum

Filter by use case category.Values: general, coding, reasoning, chat, multimodal, embedding

string

Provider substring filter (e.g., Meta, Qwen, Mistral)

string

Free-text filter across name, provider, parameter count, use case, and category. All space-separated terms must match (AND logic).

enum

Sort column.Values: score, tps, params, mem, ctx, date, use_caseDefault: score

boolean

Include unrunnable models (fit_level = too_tight).Default: true for /models, false for /models/top

integer

Per-request context cap for memory estimation. Overrides server startup --max-context flag.

Response:

{
  "node": {
    "name": "worker-1",
    "os": "linux"
  },
  "system": { "...": "..." },
  "total_models": 206,
  "returned_models": 20,
  "filters": {
    "limit": 20,
    "min_fit": "marginal",
    "runtime": "any",
    "use_case": null,
    "provider": null,
    "search": null,
    "sort": "score",
    "include_too_tight": true,
    "max_context": null
  },
  "models": [
    {
      "name": "Qwen/Qwen2.5-Coder-7B-Instruct",
      "provider": "Qwen",
      "parameter_count": "7B",
      "params_b": 7.0,
      "context_length": 32768,
      "use_case": "Coding",
      "category": "Coding",
      "release_date": "2025-03-14",
      "is_moe": false,
      "fit_level": "good",
      "fit_label": "Good",
      "run_mode": "gpu",
      "run_mode_label": "GPU",
      "score": 86.5,
      "score_components": {
        "quality": 87.0,
        "speed": 81.2,
        "fit": 90.1,
        "context": 88.0
      },
      "estimated_tps": 42.5,
      "runtime": "llamacpp",
      "runtime_label": "llama.cpp",
      "best_quant": "Q5_K_M",
      "memory_required_gb": 5.8,
      "memory_available_gb": 24.0,
      "utilization_pct": 24.2,
      "notes": [],
      "gguf_sources": [
        {
          "provider": "unsloth",
          "repo": "unsloth/Qwen2.5-Coder-7B-Instruct-GGUF"
        }
      ]
    }
  ]
}

Model Fields:

name

string

Full HuggingFace model name

provider

string

Model provider (Meta, Qwen, Mistral, etc.)

parameter_count

string

Human-readable parameter count (e.g., “7B”, “70B”)

params_b

number

Numeric parameter count in billions

context_length

integer

Maximum context window in tokens

use_case

string

Model-declared use case string

`GET /api/v1/models/top`

Key scheduling endpoint. Returns top runnable models for this node, with conservative defaults suitable for production placement decisions. Query Parameters: Same as /api/v1/models, with different defaults:

limit: Defaults to 5 (vs. no limit on /models)
include_too_tight: Defaults to false (vs. true on /models)

All other parameters work identically. Response: Same schema as /api/v1/models, but returns only top N runnable models by default. Status codes:

200 OK - Top models returned
400 Bad Request - Invalid filter values
500 Internal Server Error - Server error

Examples:

# Top 5 runnable models (default)
curl "http://localhost:8787/api/v1/models/top"

# Top 10 with minimum good fit
curl "http://localhost:8787/api/v1/models/top?limit=10&min_fit=good"

# Top 5 coding models
curl "http://localhost:8787/api/v1/models/top?limit=5&use_case=coding"

# Top 3 MLX models (Apple Silicon)
curl "http://localhost:8787/api/v1/models/top?limit=3&runtime=mlx"

`GET /api/v1/models/{name}`

Path-constrained search. Equivalent to /api/v1/models?search={name}. Returns models matching the path parameter. Path Parameters:

name

string

required

Model name or search term (URL-encoded)

Query Parameters: All parameters from /api/v1/models are supported. Response: Same schema as /api/v1/models. Status codes:

200 OK - Matching models returned
400 Bad Request - Invalid filter values
500 Internal Server Error - Server error

Examples:

# Search for Mistral models
curl "http://localhost:8787/api/v1/models/Mistral"

# Search with additional filters
curl "http://localhost:8787/api/v1/models/Llama?runtime=llamacpp&limit=5"

Error Handling

Invalid filter values return HTTP 400 with an error message:

{
  "error": "invalid min_fit value: use perfect|good|marginal|too_tight"
}

Server errors return HTTP 500:

{
  "error": "internal server error: <details>"
}

Client Integration Patterns

1. Polling Pattern for Schedulers

For each node agent:

Call GET /health to verify liveness
Call GET /api/v1/system to get node identity and hardware
Call GET /api/v1/models/top?limit=K&min_fit=good to get top runnable models
Attach node metadata and forward to your central scheduler

Example aggregator:

import requests
import json

nodes = ["http://node1:8787", "http://node2:8787", "http://node3:8787"]

for node_url in nodes:
    system = requests.get(f"{node_url}/api/v1/system").json()
    top_models = requests.get(f"{node_url}/api/v1/models/top?limit=5&min_fit=good").json()
    
    # Forward to scheduler
    payload = {
        "node": system["node"],
        "system": system["system"],
        "top_models": top_models["models"]
    }
    print(json.dumps(payload, indent=2))

2. Conservative Placement Defaults

For production placement, prefer:

min_fit=good
include_too_tight=false
sort=score
limit=5..20

This ensures only models that fit with headroom are considered.

3. Per-Workload Targeting

Coding workloads:

curl "http://localhost:8787/api/v1/models/top?use_case=coding&limit=5"

Embedding workloads:

curl "http://localhost:8787/api/v1/models/top?use_case=embedding&limit=3"

Runtime-constrained fleet (llama.cpp only):

curl "http://localhost:8787/api/v1/models/top?runtime=llamacpp&limit=10"

4. Stable Parsing

Treat unknown fields as forward-compatible additions:

Parse required fields you depend on
Ignore unknown fields
Validate enums against known values, fall back gracefully

This ensures your client works with future API versions.

Curl Examples

# Health check
curl http://127.0.0.1:8787/health

# System info
curl http://127.0.0.1:8787/api/v1/system

# Top 20 models, sorted by score, minimum marginal fit
curl "http://127.0.0.1:8787/api/v1/models?limit=20&min_fit=marginal&sort=score"

# Top 5 runnable models for scheduling
curl "http://127.0.0.1:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"

# Search for Mistral models
curl "http://127.0.0.1:8787/api/v1/models/Mistral?runtime=any"

# Pretty-print with jq
curl -s "http://127.0.0.1:8787/api/v1/models/top?limit=3" | jq '.models[] | {name, score, fit_level}'

Testing

Validate API behavior with the included test script:

# Spawn server automatically and run assertions
python3 scripts/test_api.py --spawn

# Test an already-running server
python3 scripts/test_api.py --base-url http://127.0.0.1:8787

The script validates:

Endpoint availability and response schemas
Filter behavior (min_fit, runtime, use_case)
Sort column correctness
Error handling

Versioning

Current API prefix: /api/v1/ If you build long-lived clients, pin to /api/v1/... and validate behavior with scripts/test_api.py.

For cluster scheduling, use /api/v1/models/top with min_fit=good to ensure only models with headroom are considered for placement.

The API returns the same scoring and fit analysis used by TUI and CLI modes. Results are computed on each request based on current system state.

Get Started

Core Concepts

Guides

Platform Support

Starting the Server

Base URL

Endpoints

`GET /health`

`GET /api/v1/system`

`GET /api/v1/models`

`GET /api/v1/models/top`

`GET /api/v1/models/{name}`

Error Handling

Client Integration Patterns

1. Polling Pattern for Schedulers

2. Conservative Placement Defaults

3. Per-Workload Targeting

4. Stable Parsing

Curl Examples

Testing

Versioning

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Platform Support

​Starting the Server

​Base URL

​Endpoints

​GET /health

​GET /api/v1/system

​GET /api/v1/models

​GET /api/v1/models/top

​GET /api/v1/models/{name}

​Error Handling

​Client Integration Patterns

​1. Polling Pattern for Schedulers

​2. Conservative Placement Defaults

​3. Per-Workload Targeting

​4. Stable Parsing

​Curl Examples

​Testing

​Versioning

Build docs developers (and LLMs) love

Starting the Server

Base URL

Endpoints

`GET /health`

`GET /api/v1/system`

`GET /api/v1/models`

`GET /api/v1/models/top`

`GET /api/v1/models/{name}`

Error Handling

Client Integration Patterns

1. Polling Pattern for Schedulers

2. Conservative Placement Defaults

3. Per-Workload Targeting

4. Stable Parsing

Curl Examples

Testing

Versioning