Response Envelope
All model-listing endpoints (/api/v1/models, /api/v1/models/top, /api/v1/models/{name}) return this envelope:
Node identity information
Hardware specifications (see System Object below)
Total number of models matching filters (before limit applied)
Number of models in the response (after limit applied)
Echo of active query parameters for audit trails
Array of model fit objects (see Model Object below)
System Object
Hardware specifications detected on the node.Total system RAM in gigabytes (rounded to 2 decimals)
Currently available system RAM in gigabytes
Number of logical CPU cores
CPU model name (e.g.,
"Intel(R) Core(TM) Ultra 7 165U")Whether any GPU was detected
Total VRAM across all GPUs in gigabytes (null if no GPU)
Primary GPU name (null if no GPU)
Number of GPUs detected
Whether system uses unified memory (Apple Silicon). When true, VRAM = system RAM.
Detected inference backend:
"CUDA"- NVIDIA GPU"ROCm"- AMD GPU"Metal"- Apple Silicon GPU"CPU (x86)"- x86_64 CPU fallback"CPU (ARM)"- ARM CPU fallback
Array of individual GPU objects
Example
Model Object
Complete fit analysis for a single model on the current node.Model Identity
Full model identifier (e.g.,
"meta-llama/Llama-3.3-70B-Instruct")Model provider/creator (e.g.,
"Meta", "Qwen", "Mistral AI")Human-readable parameter count (e.g.,
"7B", "70B", "8x7B" for MoE)Numeric parameter count in billions (rounded to 2 decimals)
Maximum context window in tokens
Release date in
YYYY-MM-DD formatWhether model uses Mixture of Experts architecture
Use Case Classification
Primary use case specialization:
"General"- General-purpose"Coding"- Code generation"Reasoning"- Complex reasoning"Chat"- Conversational"Multimodal"- Vision + language"Embedding"- Text embeddings
Use case label (same as
use_case, for display purposes)Fit Analysis
Model fit classification:
"perfect"- Fits entirely in VRAM with optimal quantization"good"- Fits comfortably with reasonable performance"marginal"- Barely fits, may require aggressive quantization"too_tight"- Insufficient memory, unrunnable
Human-readable fit level (e.g.,
"Perfect", "Good")Execution strategy:
"gpu"- Full GPU inference (weights in VRAM)"moe_offload"- MoE with expert offloading"cpu_offload"- GPU with layer offloading to RAM"cpu_only"- CPU-only inference (weights in system RAM)
Human-readable run mode (e.g.,
"GPU", "CPU Offload")Scoring
Overall fit score (0-100, rounded to 1 decimal). Composite of quality, speed, fit, and context scores.
Breakdown of score components
Performance Estimates
Estimated tokens per second throughput (rounded to 1 decimal)
Recommended inference runtime:
"mlx"- Apple MLX (Apple Silicon only)"llamacpp"- llama.cpp (CUDA, ROCm, Metal, CPU)
Human-readable runtime (e.g.,
"MLX", "llama.cpp")Recommended quantization level (e.g.,
"Q5_K_M", "Q4_K_M", "Q8_0")Memory Analysis
Memory required to run this model in gigabytes (rounded to 2 decimals)
Memory available on this node in gigabytes (VRAM for GPU, RAM for CPU)
Memory utilization percentage (rounded to 1 decimal):
Additional Metadata
Array of warning/info strings (e.g.,
["Requires layer offloading"]). Empty array if no notes.Array of GGUF source URLs (empty for most models, populated when available)
Example
Fit Levels Explained
| Fit Level | Code | Description |
|---|---|---|
| Perfect | perfect | Model fits entirely in VRAM with optimal quantization (Q5_K_M or higher). Expected to run at full speed with no degradation. |
| Good | good | Model fits comfortably with reasonable quantization (Q4_K_M or Q5_K_M). May use some layer offloading but maintains good performance. |
| Marginal | marginal | Model barely fits with aggressive quantization (Q3_K_M or lower) or significant layer offloading. Performance may be impacted. |
| Too Tight | too_tight | Insufficient memory to run the model. Excluded from /models/top by default. |
Run Modes Explained
| Run Mode | Code | Description |
|---|---|---|
| GPU | gpu | Full GPU inference - all model weights loaded in VRAM. Best performance. |
| MoE Offload | moe_offload | Mixture of Experts with selective expert offloading to RAM. Specialized path for MoE models. |
| CPU Offload | cpu_offload | GPU inference with some layers offloaded to system RAM. Hybrid execution for models that don’t fit entirely in VRAM. |
| CPU Only | cpu_only | CPU-only inference with weights in system RAM. Fallback when no GPU or insufficient VRAM. |
cpu_offload is skipped because VRAM and system RAM share the same pool.
Quantization Levels
Recommended quantization (best_quant field):
| Quant | Bits per Weight | Quality | Use Case |
|---|---|---|---|
| Q8_0 | 8 bits | Highest | When VRAM is abundant |
| Q6_K | 6 bits | Very high | Balanced quality/size |
| Q5_K_M | 5 bits | High | Optimal default |
| Q4_K_M | 4 bits | Good | Memory-constrained |
| Q3_K_M | 3 bits | Acceptable | Tight memory budget |
| Q2_K | 2 bits | Degraded | Last resort |
Score Components
The overallscore field is a weighted composite:
- quality: Model capability based on parameter count and architecture
- speed: Inference throughput (tokens per second)
- fit: How well the model fits in available memory (higher = more headroom)
- context: Available context window size
Forward Compatibility
Future API versions may add new fields. Clients should:- Parse only required fields your application depends on
- Ignore unknown fields to support forward compatibility
- Validate critical fields exist before accessing
- Handle null values gracefully for optional fields
