API Endpoints - llmfit

GET /health

Liveness probe for health checks and monitoring.

Request

curl http://127.0.0.1:8787/health

Response

{
  "status": "ok",
  "node": {
    "name": "worker-1",
    "os": "linux"
  }
}

Fields

status

string

required

Always returns "ok" when server is running

node

object

required

Show properties

name

string

Node hostname (from $HOSTNAME env var, defaults to "unknown-node")

string

Operating system: "linux", "macos", or "windows"

Use Cases

Kubernetes liveness/readiness probes
Load balancer health checks
Service discovery validation

GET /api/v1/system

Hardware detection endpoint returning node identity and detected system specifications.

Request

curl http://127.0.0.1:8787/api/v1/system

Response

{
  "node": {
    "name": "worker-1",
    "os": "linux"
  },
  "system": {
    "total_ram_gb": 62.23,
    "available_ram_gb": 41.08,
    "cpu_cores": 14,
    "cpu_name": "Intel(R) Core(TM) Ultra 7 165U",
    "has_gpu": true,
    "gpu_vram_gb": 24.0,
    "gpu_name": "NVIDIA RTX 4090",
    "gpu_count": 1,
    "unified_memory": false,
    "backend": "CUDA",
    "gpus": [
      {
        "name": "NVIDIA RTX 4090",
        "vram_gb": 24.0,
        "backend": "CUDA",
        "count": 1,
        "unified_memory": false
      }
    ]
  }
}

Fields

See Response Schemas - System Object for complete field documentation.

Use Cases

Cluster inventory and hardware discovery
Validating hardware requirements before placement
Displaying node capabilities in dashboards

GET /api/v1/models

Filtered model listing with scoring and fit analysis for the current node.

Request

curl "http://127.0.0.1:8787/api/v1/models?limit=20&min_fit=marginal&sort=score"

Query Parameters

All parameters are optional. See Query Parameters for details.

limit - Maximum number of models to return
perfect - Return only perfect fits
min_fit - Minimum fit level (perfect|good|marginal|too_tight)
runtime - Filter by inference runtime (any|mlx|llamacpp)
use_case - Filter by use case (coding|reasoning|chat|multimodal|embedding|general)
provider - Filter by provider substring
search - Free-text search across name/provider/params
sort - Sort column (score|tps|params|mem|ctx|date|use_case)
include_too_tight - Include unrunnable models (default: true)
max_context - Context length limit for memory estimation

Response

{
  "node": {
    "name": "worker-1",
    "os": "linux"
  },
  "system": {
    "total_ram_gb": 62.23,
    "available_ram_gb": 41.08,
    "cpu_cores": 14,
    "cpu_name": "Intel(R) Core(TM) Ultra 7 165U",
    "has_gpu": false,
    "gpu_vram_gb": null,
    "gpu_name": null,
    "gpu_count": 0,
    "unified_memory": false,
    "backend": "CPU (x86)",
    "gpus": []
  },
  "total_models": 23,
  "returned_models": 10,
  "filters": {
    "limit": 20,
    "perfect": null,
    "min_fit": "marginal",
    "runtime": null,
    "use_case": null,
    "provider": null,
    "search": null,
    "sort": "score",
    "max_context": null,
    "include_too_tight": true,
    "top_only": false
  },
  "models": [
    {
      "name": "Qwen/Qwen2.5-Coder-7B-Instruct",
      "provider": "Qwen",
      "parameter_count": "7B",
      "params_b": 7.0,
      "context_length": 32768,
      "use_case": "Coding",
      "category": "Coding",
      "release_date": "2025-03-14",
      "is_moe": false,
      "fit_level": "good",
      "fit_label": "Good",
      "run_mode": "gpu",
      "run_mode_label": "GPU",
      "score": 86.5,
      "score_components": {
        "quality": 87.0,
        "speed": 81.2,
        "fit": 90.1,
        "context": 88.0
      },
      "estimated_tps": 42.5,
      "runtime": "llamacpp",
      "runtime_label": "llama.cpp",
      "best_quant": "Q5_K_M",
      "memory_required_gb": 5.8,
      "memory_available_gb": 12.0,
      "utilization_pct": 48.3,
      "notes": [],
      "gguf_sources": []
    }
  ]
}

Fields

See Response Schemas for complete field documentation.

Use Cases

Browsing all models compatible with a node
Filtering by specific requirements (use case, runtime, fit level)
Custom sorting and ranking strategies

GET /api/v1/models/top

Top runnable models optimized for scheduling decisions. This is the key endpoint for cluster schedulers.

Request

curl "http://127.0.0.1:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"

Query Parameters

Same as /api/v1/models with different defaults:

limit defaults to 5 (instead of unlimited)
include_too_tight defaults to false (excludes unrunnable models)

See Query Parameters for all options.

Response

Identical structure to /api/v1/models, but optimized for scheduling:

{
  "node": { "name": "worker-1", "os": "linux" },
  "system": { /* hardware specs */ },
  "total_models": 15,
  "returned_models": 5,
  "filters": {
    "limit": 5,
    "min_fit": "good",
    "use_case": "coding",
    "include_too_tight": false,
    "top_only": true
  },
  "models": [ /* top 5 runnable models */ ]
}

Use Cases

Scheduler polling: Query each node for top K runnable models
Fast placement decisions: Get best options without full model list
Workload-specific routing: Filter by use case for targeted placement

Recommended Scheduler Pattern

# Poll each node in cluster
for node in worker-{1..10}; do
  curl "http://${node}:8787/api/v1/models/top?limit=5&min_fit=good" \
    | jq '{node: .node.name, top_model: .models[0].name, score: .models[0].score}'
done

GET /api/v1/models/

Model search by name - path-constrained text search.

Request

curl "http://127.0.0.1:8787/api/v1/models/Mistral?runtime=any"

Path Parameters

name

string

required

Model name search string (substring match, case-insensitive)

Query Parameters

All query parameters from /api/v1/models are supported. The {name} path parameter is automatically added as a search filter.

Response

Identical structure to /api/v1/models, filtered by name:

{
  "node": { "name": "worker-1", "os": "linux" },
  "system": { /* hardware specs */ },
  "total_models": 3,
  "returned_models": 3,
  "filters": {
    "search": "Mistral",
    "limit": 20,
    "include_too_tight": true
  },
  "models": [
    { "name": "mistralai/Mistral-7B-Instruct-v0.3", /* ... */ },
    { "name": "mistralai/Mixtral-8x7B-Instruct-v0.1", /* ... */ },
    { "name": "mistralai/Mistral-Small-Instruct-2501", /* ... */ }
  ]
}

Use Cases

Client-side drilldown after selecting a model family
Validating if a specific model runs on a node
Finding all variants of a model (e.g., all “Qwen” models)

Error Responses

HTTP 400 - Bad Request

Returned for invalid query parameter values:

{
  "error": "invalid min_fit value: use perfect|good|marginal|too_tight"
}

Common validation errors:

Invalid min_fit value
Invalid runtime value
Invalid use_case value
Invalid sort column

HTTP 500 - Internal Server Error

Returned for unexpected server errors:

{
  "error": "server error: <details>"
}

Client Integration Best Practices

1. Polling Pattern for Schedulers

For each node agent:

Call /health to verify availability
Call /api/v1/system to get hardware specs
Call /api/v1/models/top?limit=K&min_fit=good for scheduling candidates
Attach node metadata and forward to central scheduler

2. Conservative Placement Defaults

For production placement:

min_fit=good
include_too_tight=false
sort=score
limit=5..20

3. Per-Workload Targeting

Examples:

Coding workloads: use_case=coding
Embedding workloads: use_case=embedding
Runtime-constrained fleet: runtime=llamacpp

4. Forward-Compatible Parsing

Treat unknown fields as forward-compatible additions:

Parse only required fields your application depends on
Ignore unknown fields to support future API versions
Validate critical fields exist before accessing

CLI Commands

REST API

Core Library

​GET /health

​Request

​Response

​Fields

​Use Cases

​GET /api/v1/system

​Request

​Response

​Fields

​Use Cases

​GET /api/v1/models

​Request

​Query Parameters

​Response

​Fields

​Use Cases

​GET /api/v1/models/top

​Request

​Query Parameters

​Response

​Use Cases

​Recommended Scheduler Pattern

​GET /api/v1/models/

​Request

​Path Parameters

​Query Parameters

​Response

​Use Cases

​Error Responses

​HTTP 400 - Bad Request

​HTTP 500 - Internal Server Error

​Client Integration Best Practices

​1. Polling Pattern for Schedulers

​2. Conservative Placement Defaults

​3. Per-Workload Targeting

​4. Forward-Compatible Parsing

Build docs developers (and LLMs) love

GET /health

Request

Response

Fields

Use Cases

GET /api/v1/system

Request

Response

Fields

Use Cases

GET /api/v1/models

Request

Query Parameters

Response

Fields

Use Cases

GET /api/v1/models/top

Request

Query Parameters

Response

Use Cases

Recommended Scheduler Pattern

GET /api/v1/models/

Request

Path Parameters

Query Parameters

Response

Use Cases

Error Responses

HTTP 400 - Bad Request

HTTP 500 - Internal Server Error

Client Integration Best Practices

1. Polling Pattern for Schedulers

2. Conservative Placement Defaults

3. Per-Workload Targeting

4. Forward-Compatible Parsing