llmfit serve command starts an HTTP API that exposes node-local model fit analysis. This is designed for cluster schedulers, aggregators, and remote clients that need to query hardware compatibility across multiple nodes.
Starting the Server
Host interface to bind. Use
0.0.0.0 to accept connections from any interface, or 127.0.0.1 for localhost-only.Port to listen on.
--memory, --max-context) apply to all API responses, overriding hardware detection and memory estimation.
Base URL
Default local base URL:127.0.0.1 with the server’s IP or hostname.
Endpoints
GET /health
Liveness probe. Returns a simple status object indicating the server is running.
Response:
200 OK- Server is healthy
GET /api/v1/system
Node identity and hardware specs. Returns detected CPU, RAM, GPU, and backend info.
Response:
Node hostname
Operating system (linux, macos, windows)
Total system RAM in gigabytes
Available system RAM in gigabytes
Total CPU core count
CPU model name
Whether at least one GPU is detected
Total GPU VRAM across all detected GPUs (null if no GPU)
Primary GPU model name (null if no GPU)
Number of detected GPUs
Whether the system uses unified memory (Apple Silicon)
Acceleration backend: CUDA, Metal, ROCm, SYCL, CPU (x86), CPU (ARM), Ascend
Array of detected GPU objects with per-GPU details
200 OK- System info returned
GET /api/v1/models
Filtered and sorted model fit list. Returns an array of models with fit analysis, scoring, and runtime info. Supports extensive query parameters for filtering.
Query Parameters:
integer
Maximum number of models to return. Alias:
nboolean
When
true, only return models with perfect fit level (overrides min_fit)enum
Minimum fit level to include.Values:
perfect, good, marginal, too_tightDefault: marginal (includes all but too_tight)enum
Filter by inference runtime.Values:
any, mlx, llamacppDefault: anyenum
Filter by use case category.Values:
general, coding, reasoning, chat, multimodal, embeddingstring
Provider substring filter (e.g.,
Meta, Qwen, Mistral)string
Free-text filter across name, provider, parameter count, use case, and category. All space-separated terms must match (AND logic).
enum
Sort column.Values:
score, tps, params, mem, ctx, date, use_caseDefault: scoreboolean
Include unrunnable models (fit_level =
too_tight).Default: true for /models, false for /models/topinteger
Per-request context cap for memory estimation. Overrides server startup
--max-context flag.Full HuggingFace model name
Model provider (Meta, Qwen, Mistral, etc.)
Human-readable parameter count (e.g., “7B”, “70B”)
Numeric parameter count in billions
Maximum context window in tokens
Model-declared use case string
Inferred category: General, Coding, Reasoning, Chat, Multimodal, Embedding
ISO 8601 date string (YYYY-MM-DD) or null
Whether the model uses Mixture-of-Experts architecture
Fit level:
perfect, good, marginal, too_tightHuman-readable fit label
Execution mode:
gpu, moe, cpu_offload, cpu_onlyHuman-readable run mode label
Composite score (0-100) combining Quality, Speed, Fit, Context dimensions
Breakdown of the four scoring dimensions (each 0-100)
Estimated tokens per second based on hardware and quantization
Inference runtime:
mlx, llamacppHuman-readable runtime label
Best quantization that fits this hardware (e.g., Q5_K_M, Q4_K_M)
Memory required to run the model in GB
Available memory on this node in GB (VRAM or RAM depending on run mode)
Memory utilization percentage (memory_required / memory_available * 100)
Array of human-readable notes (e.g., MoE offloading, CPU-only warnings)
Array of known GGUF download sources with provider and HuggingFace repo
200 OK- Models returned400 Bad Request- Invalid filter values500 Internal Server Error- Server error
GET /api/v1/models/top
Key scheduling endpoint. Returns top runnable models for this node, with conservative defaults suitable for production placement decisions.
Query Parameters:
Same as /api/v1/models, with different defaults:
limit: Defaults to5(vs. no limit on/models)include_too_tight: Defaults tofalse(vs.trueon/models)
/api/v1/models, but returns only top N runnable models by default.
Status codes:
200 OK- Top models returned400 Bad Request- Invalid filter values500 Internal Server Error- Server error
GET /api/v1/models/{name}
Path-constrained search. Equivalent to /api/v1/models?search={name}. Returns models matching the path parameter.
Path Parameters:
Model name or search term (URL-encoded)
/api/v1/models are supported.
Response:
Same schema as /api/v1/models.
Status codes:
200 OK- Matching models returned400 Bad Request- Invalid filter values500 Internal Server Error- Server error
Error Handling
Invalid filter values return HTTP 400 with an error message:Client Integration Patterns
1. Polling Pattern for Schedulers
For each node agent:- Call
GET /healthto verify liveness - Call
GET /api/v1/systemto get node identity and hardware - Call
GET /api/v1/models/top?limit=K&min_fit=goodto get top runnable models - Attach node metadata and forward to your central scheduler
2. Conservative Placement Defaults
For production placement, prefer:3. Per-Workload Targeting
Coding workloads:4. Stable Parsing
Treat unknown fields as forward-compatible additions:- Parse required fields you depend on
- Ignore unknown fields
- Validate enums against known values, fall back gracefully
Curl Examples
Testing
Validate API behavior with the included test script:- Endpoint availability and response schemas
- Filter behavior (min_fit, runtime, use_case)
- Sort column correctness
- Error handling
Versioning
Current API prefix:/api/v1/
If you build long-lived clients, pin to /api/v1/... and validate behavior with scripts/test_api.py.
The API returns the same scoring and fit analysis used by TUI and CLI modes. Results are computed on each request based on current system state.
