Skip to main content

Response Envelope

All model-listing endpoints (/api/v1/models, /api/v1/models/top, /api/v1/models/{name}) return this envelope:
node
object
required
Node identity information
system
object
required
Hardware specifications (see System Object below)
total_models
integer
required
Total number of models matching filters (before limit applied)
returned_models
integer
required
Number of models in the response (after limit applied)
filters
object
required
Echo of active query parameters for audit trails
models
array
required
Array of model fit objects (see Model Object below)

System Object

Hardware specifications detected on the node.
total_ram_gb
number
required
Total system RAM in gigabytes (rounded to 2 decimals)
available_ram_gb
number
required
Currently available system RAM in gigabytes
cpu_cores
integer
required
Number of logical CPU cores
cpu_name
string
required
CPU model name (e.g., "Intel(R) Core(TM) Ultra 7 165U")
has_gpu
boolean
required
Whether any GPU was detected
gpu_vram_gb
number | null
required
Total VRAM across all GPUs in gigabytes (null if no GPU)
gpu_name
string | null
required
Primary GPU name (null if no GPU)
gpu_count
integer
required
Number of GPUs detected
unified_memory
boolean
required
Whether system uses unified memory (Apple Silicon). When true, VRAM = system RAM.
backend
string
required
Detected inference backend:
  • "CUDA" - NVIDIA GPU
  • "ROCm" - AMD GPU
  • "Metal" - Apple Silicon GPU
  • "CPU (x86)" - x86_64 CPU fallback
  • "CPU (ARM)" - ARM CPU fallback
gpus
array
required
Array of individual GPU objects

Example

{
  "total_ram_gb": 62.23,
  "available_ram_gb": 41.08,
  "cpu_cores": 14,
  "cpu_name": "Intel(R) Core(TM) Ultra 7 165U",
  "has_gpu": true,
  "gpu_vram_gb": 24.0,
  "gpu_name": "NVIDIA RTX 4090",
  "gpu_count": 1,
  "unified_memory": false,
  "backend": "CUDA",
  "gpus": [
    {
      "name": "NVIDIA RTX 4090",
      "vram_gb": 24.0,
      "backend": "CUDA",
      "count": 1,
      "unified_memory": false
    }
  ]
}

Model Object

Complete fit analysis for a single model on the current node.

Model Identity

name
string
required
Full model identifier (e.g., "meta-llama/Llama-3.3-70B-Instruct")
provider
string
required
Model provider/creator (e.g., "Meta", "Qwen", "Mistral AI")
parameter_count
string
required
Human-readable parameter count (e.g., "7B", "70B", "8x7B" for MoE)
params_b
number
required
Numeric parameter count in billions (rounded to 2 decimals)
context_length
integer
required
Maximum context window in tokens
release_date
string
required
Release date in YYYY-MM-DD format
is_moe
boolean
required
Whether model uses Mixture of Experts architecture

Use Case Classification

use_case
string
required
Primary use case specialization:
  • "General" - General-purpose
  • "Coding" - Code generation
  • "Reasoning" - Complex reasoning
  • "Chat" - Conversational
  • "Multimodal" - Vision + language
  • "Embedding" - Text embeddings
category
string
required
Use case label (same as use_case, for display purposes)

Fit Analysis

fit_level
string
required
Model fit classification:
  • "perfect" - Fits entirely in VRAM with optimal quantization
  • "good" - Fits comfortably with reasonable performance
  • "marginal" - Barely fits, may require aggressive quantization
  • "too_tight" - Insufficient memory, unrunnable
fit_label
string
required
Human-readable fit level (e.g., "Perfect", "Good")
run_mode
string
required
Execution strategy:
  • "gpu" - Full GPU inference (weights in VRAM)
  • "moe_offload" - MoE with expert offloading
  • "cpu_offload" - GPU with layer offloading to RAM
  • "cpu_only" - CPU-only inference (weights in system RAM)
run_mode_label
string
required
Human-readable run mode (e.g., "GPU", "CPU Offload")

Scoring

score
number
required
Overall fit score (0-100, rounded to 1 decimal). Composite of quality, speed, fit, and context scores.
score_components
object
required
Breakdown of score components

Performance Estimates

estimated_tps
number
required
Estimated tokens per second throughput (rounded to 1 decimal)
runtime
string
required
Recommended inference runtime:
  • "mlx" - Apple MLX (Apple Silicon only)
  • "llamacpp" - llama.cpp (CUDA, ROCm, Metal, CPU)
runtime_label
string
required
Human-readable runtime (e.g., "MLX", "llama.cpp")
best_quant
string
required
Recommended quantization level (e.g., "Q5_K_M", "Q4_K_M", "Q8_0")

Memory Analysis

memory_required_gb
number
required
Memory required to run this model in gigabytes (rounded to 2 decimals)
memory_available_gb
number
required
Memory available on this node in gigabytes (VRAM for GPU, RAM for CPU)
utilization_pct
number
required
Memory utilization percentage (rounded to 1 decimal):
(memory_required_gb / memory_available_gb) * 100

Additional Metadata

notes
array
required
Array of warning/info strings (e.g., ["Requires layer offloading"]). Empty array if no notes.
gguf_sources
array
required
Array of GGUF source URLs (empty for most models, populated when available)

Example

{
  "name": "Qwen/Qwen2.5-Coder-7B-Instruct",
  "provider": "Qwen",
  "parameter_count": "7B",
  "params_b": 7.0,
  "context_length": 32768,
  "use_case": "Coding",
  "category": "Coding",
  "release_date": "2024-11-12",
  "is_moe": false,
  "fit_level": "good",
  "fit_label": "Good",
  "run_mode": "gpu",
  "run_mode_label": "GPU",
  "score": 86.5,
  "score_components": {
    "quality": 87.0,
    "speed": 81.2,
    "fit": 90.1,
    "context": 88.0
  },
  "estimated_tps": 42.5,
  "runtime": "llamacpp",
  "runtime_label": "llama.cpp",
  "best_quant": "Q5_K_M",
  "memory_required_gb": 5.8,
  "memory_available_gb": 12.0,
  "utilization_pct": 48.3,
  "notes": [],
  "gguf_sources": [
    "https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF"
  ]
}

Fit Levels Explained

Fit LevelCodeDescription
PerfectperfectModel fits entirely in VRAM with optimal quantization (Q5_K_M or higher). Expected to run at full speed with no degradation.
GoodgoodModel fits comfortably with reasonable quantization (Q4_K_M or Q5_K_M). May use some layer offloading but maintains good performance.
MarginalmarginalModel barely fits with aggressive quantization (Q3_K_M or lower) or significant layer offloading. Performance may be impacted.
Too Tighttoo_tightInsufficient memory to run the model. Excluded from /models/top by default.

Run Modes Explained

Run ModeCodeDescription
GPUgpuFull GPU inference - all model weights loaded in VRAM. Best performance.
MoE Offloadmoe_offloadMixture of Experts with selective expert offloading to RAM. Specialized path for MoE models.
CPU Offloadcpu_offloadGPU inference with some layers offloaded to system RAM. Hybrid execution for models that don’t fit entirely in VRAM.
CPU Onlycpu_onlyCPU-only inference with weights in system RAM. Fallback when no GPU or insufficient VRAM.
Note: On unified memory systems (Apple Silicon), cpu_offload is skipped because VRAM and system RAM share the same pool.

Quantization Levels

Recommended quantization (best_quant field):
QuantBits per WeightQualityUse Case
Q8_08 bitsHighestWhen VRAM is abundant
Q6_K6 bitsVery highBalanced quality/size
Q5_K_M5 bitsHighOptimal default
Q4_K_M4 bitsGoodMemory-constrained
Q3_K_M3 bitsAcceptableTight memory budget
Q2_K2 bitsDegradedLast resort
The API automatically selects the best quantization level that fits available memory while maximizing quality.

Score Components

The overall score field is a weighted composite:
score = (quality * 0.3) + (speed * 0.3) + (fit * 0.25) + (context * 0.15)
Component breakdown:
  • quality: Model capability based on parameter count and architecture
  • speed: Inference throughput (tokens per second)
  • fit: How well the model fits in available memory (higher = more headroom)
  • context: Available context window size
All components and the final score range from 0-100.

Forward Compatibility

Future API versions may add new fields. Clients should:
  1. Parse only required fields your application depends on
  2. Ignore unknown fields to support forward compatibility
  3. Validate critical fields exist before accessing
  4. Handle null values gracefully for optional fields
Example robust parsing:
for model in response["models"]:
    if "name" not in model or "fit_level" not in model:
        continue  # Skip malformed entries
    
    if model["fit_level"] not in ["perfect", "good", "marginal"]:
        continue  # Skip unrunnable models
    
    # Safe to use model["name"] and model["fit_level"]

Build docs developers (and LLMs) love