Starting the server
--model is given at startup. Only one model is kept in memory at a time; requesting a different model evicts the current one.
GET /models
List all MLX models that are already downloaded in the local Hugging Face cache.
Aliases: GET /v1/models
Response
Always
"list".Example
POST /chat/completions
OpenAI-compatible chat endpoint. Accepts a list of messages with optional image and audio content. Supports both streaming (Server-Sent Events) and non-streaming responses.
Aliases: POST /v1/chat/completions
Request body
Hugging Face repository ID or local path to the model (e.g.
"mlx-community/Qwen2-VL-2B-Instruct-4bit"). The model is loaded on first use and cached until a different model is requested.Conversation history. Each element is a message object:
Maximum number of tokens to generate.
When
true, the response is streamed as Server-Sent Events. Each event is a JSON object prefixed with data: . The stream ends with data: [DONE].Sampling temperature.
0.0 uses greedy decoding.Nucleus sampling probability mass.
Restrict sampling to the top-k tokens.
Minimum probability threshold relative to the most likely token.
Penalty for repeated tokens. Values above
1.0 discourage repetition.Random seed for reproducible generation.
Path to LoRA adapter weights to apply on top of the base model.
Resize images before processing. Pass
[size] for a square or [height, width] for a specific shape.Enable thinking mode in the chat template (e.g. for Qwen3.5 reasoning models).
Maximum tokens allowed inside a thinking block. Requires
enable_thinking: true.Non-streaming response
Streaming response (SSE)
Each chunk is sent asdata: <json>\n\n. The stream ends with data: [DONE]\n\n.
Examples
POST /responses
OpenAI Responses API-compatible endpoint. Mirrors the structure of the OpenAI Python SDK’s client.responses.create method.
Aliases: POST /v1/responses
Request body
Model identifier (Hugging Face repo ID or local path).
Input text string, or a list of message objects in the same format as
/chat/completions. When a system role message is present it is used as the model’s system instructions.Maximum number of tokens to generate.
Stream the response using the OpenAI Responses streaming event format.
Sampling temperature.
Nucleus sampling probability mass.
Non-streaming response
Streaming events
Whenstream: true, the server emits a sequence of Server-Sent Events that mirror the OpenAI Responses streaming protocol:
| Event type | Description |
|---|---|
response.created | Emitted once when the response object is first created. |
response.in_progress | Emitted once when generation starts. |
response.output_item.added | A new output message item has been added. |
response.content_part.added | A new content part (text block) has been opened. |
response.output_text.delta | A text token delta. The delta field contains the new text segment. |
response.output_text.done | Full text for the content part is complete. |
response.content_part.done | The content part has closed. |
response.output_item.done | The output message item has completed. |
response.completed | Final event; includes full usage stats. |
Example
GET /health
Check whether the server is running and which model (if any) is currently loaded.
Response
Always
"healthy" when the server is reachable.Path or repo ID of the currently loaded model, or
null if no model is loaded.Path to the currently loaded adapter, or
null.Example
POST /unload
Unload the currently loaded model from memory and clear the MLX cache. Useful for freeing GPU/unified memory before loading a different model.
Response — model was loaded
Response — no model was loaded
"success" when a model was unloaded, "no_model_loaded" otherwise.Example
Server environment variables
Several generation parameters can be set via environment variables. These apply to every request handled by the server:| Variable | Default | Description |
|---|---|---|
PREFILL_STEP_SIZE | 2048 | Tokens processed per prefill step. Lower values reduce peak memory. |
KV_BITS | 0 (disabled) | Bits for KV cache quantization. Set to 4 or 8 to enable. |
KV_GROUP_SIZE | 64 | Group size for KV cache quantization. |
MAX_KV_SIZE | 0 (disabled) | Maximum KV cache size in tokens. |
QUANTIZED_KV_START | 5000 | Token index at which KV cache quantization starts. |
MLX_TRUST_REMOTE_CODE | false | Set to true to trust remote code for all loaded models. |
PRELOAD_MODEL | — | Model path or repo ID to load at server startup. |
PRELOAD_ADAPTER | — | Adapter path to load at server startup alongside PRELOAD_MODEL. |
The server keeps exactly one model loaded at a time. Sending a request with a different
model value automatically evicts the current model before loading the new one.