serve

Synopsis

llmfit serve [OPTIONS]

Description

Starts an HTTP REST API server for cluster and node scheduling workflows. This allows other services to query llmfit’s model compatibility analysis programmatically. The API is designed for:

Kubernetes schedulers and operators
Multi-node orchestration systems
CI/CD pipelines
Infrastructure-as-code tools
Monitoring and observability platforms

Options

--host

string

default:"0.0.0.0"

Host interface to bind. Use “0.0.0.0” to listen on all interfaces, or “127.0.0.1” for localhost only.

--port

integer

default:"8787"

Port to listen on.

--memory

string

Override GPU VRAM size (e.g., “32G”, “32000M”, “1.5T”).

--max-context

integer

Cap context length used for memory estimation (tokens). Must be >= 1.

Usage Examples

Start Server

# Start on default port (8787)
llmfit serve

# Start on custom port
llmfit serve --port 8080

# Bind to localhost only (more secure)
llmfit serve --host 127.0.0.1 --port 8787

Override Detection

# Override detected VRAM
llmfit serve --memory 24G

# Cap context length
llmfit serve --max-context 8192

Production Deployment

# Run in background
nohup llmfit serve --host 0.0.0.0 --port 8787 > llmfit.log 2>&1 &

# With systemd
sudo systemctl start llmfit-api

# With Docker
docker run -p 8787:8787 llmfit serve

API Endpoints

Health Check

GET /health

Returns server health status. Response:

{
  "status": "ok",
  "node": {
    "name": "gpu-node-01",
    "os": "linux"
  }
}

System Information

GET /api/v1/system

Returns detected hardware specifications. Response:

{
  "node": {
    "name": "gpu-node-01",
    "os": "linux"
  },
  "system": {
    "total_ram_gb": 128.0,
    "available_ram_gb": 115.5,
    "cpu_cores": 32,
    "cpu_name": "AMD EPYC 7763",
    "has_gpu": true,
    "gpu_vram_gb": 80.0,
    "gpu_name": "NVIDIA A100 80GB PCIe",
    "gpu_count": 1,
    "unified_memory": false,
    "backend": "CUDA"
  }
}

List Compatible Models

GET /api/v1/models?limit=20&min_fit=good&sort=score

Returns models compatible with the node’s hardware. Query Parameters:

limit (integer): Maximum results to return
n (integer): Alias for limit
perfect (boolean): Show only perfect fits
min_fit (enum): Minimum fit level (perfect, good, marginal, too_tight)
runtime (enum): Filter by runtime (any, mlx, llamacpp)
use_case (enum): Filter by use case (general, coding, reasoning, chat, multimodal, embedding)
provider (string): Filter by provider name
search (string): Search query
sort (enum): Sort column (score, tps, params, mem, ctx, date, use)
include_too_tight (boolean): Include models that are too tight
max_context (integer): Context length cap for memory estimation

Response:

{
  "node": {
    "name": "gpu-node-01",
    "os": "linux"
  },
  "system": { ... },
  "total_models": 45,
  "returned_models": 20,
  "filters": {
    "limit": 20,
    "min_fit": "good",
    "sort": "score"
  },
  "models": [
    {
      "name": "llama-3.1-405b",
      "provider": "Meta",
      "parameter_count": "405B",
      "fit_level": "perfect",
      "run_mode": "gpu",
      "score": 98.5,
      "estimated_tps": 18.2,
      "runtime": "llamacpp",
      "best_quant": "Q4_K_M",
      "memory_required_gb": 210.5,
      "utilization_pct": 87.3
    }
  ]
}

Top Recommendations

GET /api/v1/models/top?limit=5&use_case=coding&min_fit=good

Returns top N recommended models for the node. Query Parameters: Same as /api/v1/models Response: Same format as /api/v1/models

Model by Name

GET /api/v1/models/{name}

Returns models matching the specified name (partial match). Path Parameters:

name (string): Model name or partial name

Query Parameters: Same as /api/v1/models Response: Same format as /api/v1/models

Example Output

Starting the Server

$ llmfit serve --host 0.0.0.0 --port 8787

llmfit API server listening on http://0.0.0.0:8787
  GET /health
  GET /api/v1/system
  GET /api/v1/models?limit=20&min_fit=marginal&sort=score
  GET /api/v1/models/top?limit=5&use_case=coding&min_fit=good
  GET /api/v1/models/<name>

Testing Endpoints

# Health check
curl http://localhost:8787/health

# System info
curl http://localhost:8787/api/v1/system

# Top 10 models
curl 'http://localhost:8787/api/v1/models?limit=10&sort=score'

# Best coding models
curl 'http://localhost:8787/api/v1/models/top?limit=5&use_case=coding'

# Search for Llama models
curl http://localhost:8787/api/v1/models/llama

Integration Examples

Kubernetes Scheduler

apiVersion: v1
kind: Pod
metadata:
  name: model-query
spec:
  containers:
  - name: query
    image: curlimages/curl
    command:
    - sh
    - -c
    - |
      curl -s http://llmfit-api:8787/api/v1/models/top?limit=1 | \
      jq -r '.models[0].name'

Python Client

import requests

class LlmfitClient:
    def __init__(self, base_url="http://localhost:8787"):
        self.base_url = base_url
    
    def get_system(self):
        r = requests.get(f"{self.base_url}/api/v1/system")
        return r.json()
    
    def get_top_models(self, limit=5, use_case=None, min_fit="good"):
        params = {"limit": limit, "min_fit": min_fit}
        if use_case:
            params["use_case"] = use_case
        r = requests.get(f"{self.base_url}/api/v1/models/top", params=params)
        return r.json()["models"]
    
    def search_models(self, query):
        r = requests.get(f"{self.base_url}/api/v1/models/{query}")
        return r.json()["models"]

# Usage
client = LlmfitClient()
system = client.get_system()
print(f"Node: {system['node']['name']}")
print(f"VRAM: {system['system']['gpu_vram_gb']} GB")

top = client.get_top_models(limit=3, use_case="coding")
for model in top:
    print(f"{model['name']}: {model['score']} score, {model['estimated_tps']} tok/s")

Bash Script

#!/bin/bash

API="http://localhost:8787"

# Get best model for this node
BEST_MODEL=$(curl -s "$API/api/v1/models/top?limit=1" | jq -r '.models[0].name')
echo "Best model for this node: $BEST_MODEL"

# Check if specific model fits
QUERY="llama-3.3-70b"
FIT=$(curl -s "$API/api/v1/models/$QUERY" | jq -r '.models[0].fit_level')
echo "$QUERY fit level: $FIT"

# Get all perfect fits
PERFECT=$(curl -s "$API/api/v1/models?perfect=true" | jq '.total_models')
echo "Perfect fits available: $PERFECT"

Terraform Provider

data "http" "llmfit_system" {
  url = "http://${var.llmfit_host}:8787/api/v1/system"
}

locals {
  system = jsondecode(data.http.llmfit_system.body)
  vram_gb = local.system.system.gpu_vram_gb
}

resource "kubernetes_deployment" "model_server" {
  metadata {
    name = "llm-server"
    labels = {
      node_vram = local.vram_gb
    }
  }
  # ...
}

Security Considerations

Bind to Localhost

For single-node use, bind to localhost:

llmfit serve --host 127.0.0.1

Reverse Proxy

Use nginx or similar for authentication:

location /llmfit/ {
    auth_basic "llmfit API";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://127.0.0.1:8787/;
}

Firewall Rules

# Allow only from specific subnet
sudo ufw allow from 10.0.0.0/8 to any port 8787

Network Policy (Kubernetes)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llmfit-api
spec:
  podSelector:
    matchLabels:
      app: llmfit-api
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: scheduling
    ports:
    - port: 8787

system - Show system specs
fit - Find compatible models
recommend - Get recommendations
info - Model information

CLI Commands

REST API

Core Library

Synopsis

Description

Options

Usage Examples

Start Server

Override Detection

Production Deployment

API Endpoints

Health Check

System Information

List Compatible Models

Top Recommendations

Model by Name

Example Output

Starting the Server

Testing Endpoints

Integration Examples

Kubernetes Scheduler

Python Client

Bash Script

Terraform Provider

Security Considerations

Bind to Localhost

Reverse Proxy

Firewall Rules

Network Policy (Kubernetes)

Build docs developers (and LLMs) love

CLI Commands

REST API

Core Library

​Synopsis

​Description

​Options

​Usage Examples

​Start Server

​Override Detection

​Production Deployment

​API Endpoints

​Health Check

​System Information

​List Compatible Models

​Top Recommendations

​Model by Name

​Example Output

​Starting the Server

​Testing Endpoints

​Integration Examples

​Kubernetes Scheduler

​Python Client

​Bash Script

​Terraform Provider

​Security Considerations

​Bind to Localhost

​Reverse Proxy

​Firewall Rules

​Network Policy (Kubernetes)

​Related Commands

Build docs developers (and LLMs) love

Synopsis

Description

Options

Usage Examples

Start Server

Override Detection

Production Deployment

API Endpoints

Health Check

System Information

List Compatible Models

Top Recommendations

Model by Name

Example Output

Starting the Server

Testing Endpoints

Integration Examples

Kubernetes Scheduler

Python Client

Bash Script

Terraform Provider

Security Considerations

Bind to Localhost

Reverse Proxy

Firewall Rules

Network Policy (Kubernetes)

Related Commands