Query Parameters

All query parameters are optional and supported on:

GET /api/v1/models
GET /api/v1/models/top
GET /api/v1/models/{name}

Pagination

limit

integer

default:"unlimited for /models, 5 for /models/top"

Maximum number of models to return.Alias: n (shorthand)Examples:

limit=10 - Return top 10 models
n=5 - Same as limit=5
No limit specified - /models returns all matching, /models/top returns 5

Fit Filtering

perfect

boolean

default:"false"

When true, return only models with perfect fit level (100% GPU, optimal quantization).Overrides min_fit when set.Valid values: true, falseExample:

curl "http://127.0.0.1:8787/api/v1/models?perfect=true"

min_fit

string

default:"marginal"

Minimum fit level to include. Excludes models below this threshold.Valid values:

perfect - Only models that fit entirely in VRAM with optimal quant
good - Models that fit comfortably with reasonable performance
marginal - Models that barely fit or require heavy quantization
too_tight - Unrunnable models (usually filtered out)

Fit level hierarchy: perfect > good > marginal > too_tightExamples:

# Only good and perfect fits
curl "http://127.0.0.1:8787/api/v1/models?min_fit=good"

# Include marginal fits (default)
curl "http://127.0.0.1:8787/api/v1/models?min_fit=marginal"

include_too_tight

boolean

default:"true for /models, false for /models/top"

Include models with too_tight fit level (unrunnable due to insufficient memory).Use cases:

Set to false for production scheduling (exclude unrunnable models)
Set to true for exploratory analysis (see what’s close to runnable)

Example:

# Exclude unrunnable models
curl "http://127.0.0.1:8787/api/v1/models?include_too_tight=false"

Runtime Filtering

runtime

string

default:"any"

Filter by inference runtime/backend.Valid values:

any - All runtimes (default)
mlx - Apple MLX (Apple Silicon only)
llamacpp - llama.cpp (CUDA, ROCm, Metal, CPU)

Aliases: llama.cpp, llama_cpp all map to llamacppExamples:

# Only llama.cpp compatible models
curl "http://127.0.0.1:8787/api/v1/models?runtime=llamacpp"

# Only MLX models (Apple Silicon)
curl "http://127.0.0.1:8787/api/v1/models?runtime=mlx"

Use Case Filtering

use_case

string

Filter by model use case specialization.Valid values:

general - General-purpose models
coding - Code generation and completion
reasoning - Complex reasoning and problem-solving
chat - Conversational/instruction-following
multimodal - Vision + language models
embedding - Text embedding models

Aliases:

code → coding
reason → reasoning
vision → multimodal
embed → embedding

Examples:

# Coding-specialized models
curl "http://127.0.0.1:8787/api/v1/models?use_case=coding"

# Reasoning models
curl "http://127.0.0.1:8787/api/v1/models?use_case=reasoning"

Text Filtering

provider

string

Filter by model provider (case-insensitive substring match).Matches against: Model provider fieldExamples:

# All Meta models
curl "http://127.0.0.1:8787/api/v1/models?provider=Meta"

# All Qwen models
curl "http://127.0.0.1:8787/api/v1/models?provider=qwen"

string

Free-text search across multiple fields (case-insensitive substring match).Searches:

Model name
Provider name
Parameter count (e.g., “7B”, “70B”)
Use case
Category label

Examples:

# Find all "Llama" models
curl "http://127.0.0.1:8787/api/v1/models?search=llama"

# Find 7B parameter models
curl "http://127.0.0.1:8787/api/v1/models?search=7B"

# Search by use case
curl "http://127.0.0.1:8787/api/v1/models?search=coding"

Sorting

sort

string

default:"score"

Sort order for returned models.Valid values:

score - Overall fit score (default, descending)
tps - Estimated tokens per second (descending)
params - Parameter count (descending)
mem - Memory utilization percentage (ascending)
ctx - Context length (descending)
date - Release date (newest first)
use_case - Use case category (alphabetical)

Aliases:

tokens, throughput → tps
parameters → params
memory, mem_pct, utilization → mem
context → ctx
release, released → date
use, usecase → use_case

Examples:

# Sort by inference speed
curl "http://127.0.0.1:8787/api/v1/models?sort=tps"

# Sort by context window size
curl "http://127.0.0.1:8787/api/v1/models?sort=ctx"

# Sort by parameter count
curl "http://127.0.0.1:8787/api/v1/models?sort=params"

Context Configuration

max_context

integer

Maximum context length to use for memory estimation (in tokens).Overrides the global --max-context flag for this request.Use cases:

Query “what if” scenarios with different context requirements
Find models suitable for specific context window needs
Per-request memory budget constraints

Effect: Models are re-scored with the specified context limit, affecting:

Memory requirements
Fit levels
Score calculations
Quantization recommendations

Examples:

# Find models for 8K context
curl "http://127.0.0.1:8787/api/v1/models?max_context=8192"

# Find models for 32K context
curl "http://127.0.0.1:8787/api/v1/models?max_context=32768"

Common Combinations

Production Scheduling

Conservative defaults for reliable placement:

curl "http://127.0.0.1:8787/api/v1/models/top?limit=5&min_fit=good&include_too_tight=false&sort=score"

Coding Workload

Find top coding models:

curl "http://127.0.0.1:8787/api/v1/models/top?use_case=coding&min_fit=good&limit=10"

High-Throughput Models

Sort by inference speed:

curl "http://127.0.0.1:8787/api/v1/models?sort=tps&min_fit=good&limit=20"

Large Context Requirements

Find models for long-context tasks:

curl "http://127.0.0.1:8787/api/v1/models?max_context=32768&sort=ctx&min_fit=good"

Provider-Specific

All models from a specific provider:

curl "http://127.0.0.1:8787/api/v1/models?provider=Meta&min_fit=marginal"

Runtime-Constrained

llama.cpp only deployment:

curl "http://127.0.0.1:8787/api/v1/models/top?runtime=llamacpp&limit=10"

Error Handling

Invalid parameter values return HTTP 400 with error details: Invalid min_fit:

{
  "error": "invalid min_fit value: use perfect|good|marginal|too_tight"
}

Invalid runtime:

{
  "error": "invalid runtime value: use any|mlx|llamacpp"
}

Invalid use_case:

{
  "error": "invalid use_case value: use general|coding|reasoning|chat|multimodal|embedding"
}

Invalid sort:

{
  "error": "invalid sort value: use score|tps|params|mem|ctx|date|use_case"
}

Parameter Precedence

When multiple filters interact:

perfect=true overrides min_fit
include_too_tight=false removes too_tight models regardless of min_fit
search parameter in path (/models/{name}) is added to query search filter
max_context overrides global --max-context flag
limit is applied after all filtering and sorting

Default Values by Endpoint

Parameter	`/api/v1/models`	`/api/v1/models/top`
`limit`	unlimited	5
`min_fit`	marginal	marginal
`include_too_tight`	true	false
`sort`	score	score
`perfect`	false	false
All others	none	none

CLI Commands

REST API

Core Library

Query Parameters

Fit Filtering

Runtime Filtering

Use Case Filtering

Text Filtering

Sorting

Context Configuration

Common Combinations

Production Scheduling

Coding Workload

High-Throughput Models

Large Context Requirements

Provider-Specific

Runtime-Constrained

Error Handling

Parameter Precedence

Default Values by Endpoint

Build docs developers (and LLMs) love

CLI Commands

REST API

Core Library

​Pagination

​Fit Filtering

​Runtime Filtering

​Use Case Filtering

​Text Filtering

​Sorting

​Context Configuration

​Common Combinations

​Production Scheduling

​Coding Workload

​High-Throughput Models

​Large Context Requirements

​Provider-Specific

​Runtime-Constrained

​Error Handling

​Parameter Precedence

​Default Values by Endpoint

Build docs developers (and LLMs) love

Pagination

Fit Filtering

Runtime Filtering

Use Case Filtering

Text Filtering

Sorting

Context Configuration

Common Combinations

Production Scheduling

Coding Workload

High-Throughput Models

Large Context Requirements

Provider-Specific

Runtime-Constrained

Error Handling

Parameter Precedence

Default Values by Endpoint