GET /api/v1/modelsGET /api/v1/models/topGET /api/v1/models/{name}
Pagination
Maximum number of models to return.Alias:
n (shorthand)Examples:limit=10- Return top 10 modelsn=5- Same aslimit=5- No limit specified -
/modelsreturns all matching,/models/topreturns 5
Fit Filtering
When
true, return only models with perfect fit level (100% GPU, optimal quantization).Overrides min_fit when set.Valid values: true, falseExample:Minimum fit level to include. Excludes models below this threshold.Valid values:
perfect- Only models that fit entirely in VRAM with optimal quantgood- Models that fit comfortably with reasonable performancemarginal- Models that barely fit or require heavy quantizationtoo_tight- Unrunnable models (usually filtered out)
perfect > good > marginal > too_tightExamples:Include models with
too_tight fit level (unrunnable due to insufficient memory).Use cases:- Set to
falsefor production scheduling (exclude unrunnable models) - Set to
truefor exploratory analysis (see what’s close to runnable)
Runtime Filtering
Filter by inference runtime/backend.Valid values:
any- All runtimes (default)mlx- Apple MLX (Apple Silicon only)llamacpp- llama.cpp (CUDA, ROCm, Metal, CPU)
llama.cpp, llama_cpp all map to llamacppExamples:Use Case Filtering
Filter by model use case specialization.Valid values:
general- General-purpose modelscoding- Code generation and completionreasoning- Complex reasoning and problem-solvingchat- Conversational/instruction-followingmultimodal- Vision + language modelsembedding- Text embedding models
code→codingreason→reasoningvision→multimodalembed→embedding
Text Filtering
Filter by model provider (case-insensitive substring match).Matches against: Model provider fieldExamples:
Free-text search across multiple fields (case-insensitive substring match).Searches:
- Model name
- Provider name
- Parameter count (e.g., “7B”, “70B”)
- Use case
- Category label
Sorting
Sort order for returned models.Valid values:
score- Overall fit score (default, descending)tps- Estimated tokens per second (descending)params- Parameter count (descending)mem- Memory utilization percentage (ascending)ctx- Context length (descending)date- Release date (newest first)use_case- Use case category (alphabetical)
tokens,throughput→tpsparameters→paramsmemory,mem_pct,utilization→memcontext→ctxrelease,released→dateuse,usecase→use_case
Context Configuration
Maximum context length to use for memory estimation (in tokens).Overrides the global
--max-context flag for this request.Use cases:- Query “what if” scenarios with different context requirements
- Find models suitable for specific context window needs
- Per-request memory budget constraints
- Memory requirements
- Fit levels
- Score calculations
- Quantization recommendations
Common Combinations
Production Scheduling
Conservative defaults for reliable placement:Coding Workload
Find top coding models:High-Throughput Models
Sort by inference speed:Large Context Requirements
Find models for long-context tasks:Provider-Specific
All models from a specific provider:Runtime-Constrained
llama.cpp only deployment:Error Handling
Invalid parameter values return HTTP 400 with error details: Invalidmin_fit:
runtime:
use_case:
sort:
Parameter Precedence
When multiple filters interact:perfect=trueoverridesmin_fitinclude_too_tight=falseremoves too_tight models regardless ofmin_fitsearchparameter in path (/models/{name}) is added to querysearchfiltermax_contextoverrides global--max-contextflaglimitis applied after all filtering and sorting
Default Values by Endpoint
| Parameter | /api/v1/models | /api/v1/models/top |
|---|---|---|
limit | unlimited | 5 |
min_fit | marginal | marginal |
include_too_tight | true | false |
sort | score | score |
perfect | false | false |
| All others | none | none |
