REST API endpoints

Starting the server

# Default host 0.0.0.0, port 8080
mlx_vlm.server

# Preload a model at startup
mlx_vlm.server --model mlx-community/Qwen2-VL-2B-Instruct-4bit --port 8080

# With a LoRA adapter
mlx_vlm.server --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --adapter-path ./my-adapter

# Trust remote code (required by some models)
mlx_vlm.server --trust-remote-code

The server loads models on the first request if no --model is given at startup. Only one model is kept in memory at a time; requesting a different model evicts the current one.

`GET /models`

List all MLX models that are already downloaded in the local Hugging Face cache. Aliases: GET /v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
      "object": "model",
      "created": 1718000000
    },
    {
      "id": "mlx-community/Qwen2.5-VL-3B-Instruct-8bit",
      "object": "model",
      "created": 1719000000
    }
  ]
}

object

string

Always "list".

data

array

Show ModelInfo fields

string

Hugging Face repository ID of the locally cached model.

object

string

Always "model".

created

integer

Unix timestamp of when the model was last modified in the local cache.

Example

curl "http://localhost:8080/models"

`POST /chat/completions`

OpenAI-compatible chat endpoint. Accepts a list of messages with optional image and audio content. Supports both streaming (Server-Sent Events) and non-streaming responses. Aliases: POST /v1/chat/completions

Request body

model

string

required

Hugging Face repository ID or local path to the model (e.g. "mlx-community/Qwen2-VL-2B-Instruct-4bit"). The model is loaded on first use and cached until a different model is requested.

messages

array

required

Conversation history. Each element is a message object:

Show Message object fields

role

string

required

One of "system", "user", or "assistant".

content

string | array

required

Either a plain string or a list of content items for multimodal input.

Show Content item types

type

string

"text" / "input_text" — text content. Include a "text" key with the string value.

type

string

"input_image" — inline image. Include an "image_url" key with a URL, local path, or data:image/...;base64,... string.

type

string

"image_url" — OpenAI-style image. Include an "image_url" object with a "url" key.

type

string

"input_audio" — audio file. Include an "input_audio" object with a "data" key (URL or local path) and a "format" key.

max_tokens

integer

default:"256"

Maximum number of tokens to generate.

stream

boolean

default:"false"

When true, the response is streamed as Server-Sent Events. Each event is a JSON object prefixed with data: . The stream ends with data: [DONE].

temperature

float

default:"0.0"

Sampling temperature. 0.0 uses greedy decoding.

top_p

float

default:"1.0"

Nucleus sampling probability mass.

top_k

integer

default:"None"

Restrict sampling to the top-k tokens.

min_p

float

default:"None"

Minimum probability threshold relative to the most likely token.

repetition_penalty

float

default:"None"

Penalty for repeated tokens. Values above 1.0 discourage repetition.

seed

integer

default:"0"

Random seed for reproducible generation.

adapter_path

string

default:"None"

Path to LoRA adapter weights to apply on top of the base model.

resize_shape

array

default:"None"

Resize images before processing. Pass [size] for a square or [height, width] for a specific shape.

enable_thinking

boolean

default:"false"

Enable thinking mode in the chat template (e.g. for Qwen3.5 reasoning models).

thinking_budget

integer

default:"None"

Maximum tokens allowed inside a thinking block. Requires enable_thinking: true.

Non-streaming response

{
  "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
  "choices": [
    {
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "The image shows two cats sleeping on a red sofa."
      }
    }
  ],
  "usage": {
    "input_tokens": 42,
    "output_tokens": 18,
    "total_tokens": 60,
    "prompt_tps": 310.5,
    "generation_tps": 42.3,
    "peak_memory": 3.84
  }
}

Streaming response (SSE)

Each chunk is sent as data: <json>\n\n. The stream ends with data: [DONE]\n\n.

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1718000000,"model":"mlx-community/Qwen2-VL-2B-Instruct-4bit","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":"The"}}],"usage":{"input_tokens":42,"output_tokens":1,...}}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...,"choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":" image"}}],...}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...,"choices":[{"index":0,"finish_reason":"stop","delta":{"role":"assistant","content":""}}],...}

data: [DONE]

Examples

curl -X POST "http://localhost:8080/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you?"
      }
    ],
    "max_tokens": 100,
    "stream": true
  }'

`POST /responses`

OpenAI Responses API-compatible endpoint. Mirrors the structure of the OpenAI Python SDK’s client.responses.create method. Aliases: POST /v1/responses

Request body

model

string

required

Model identifier (Hugging Face repo ID or local path).

input

string | array

required

Input text string, or a list of message objects in the same format as /chat/completions. When a system role message is present it is used as the model’s system instructions.

max_output_tokens

integer

default:"256"

Maximum number of tokens to generate.

stream

boolean

default:"false"

Stream the response using the OpenAI Responses streaming event format.

temperature

float

default:"0.0"

Sampling temperature.

top_p

float

default:"1.0"

Nucleus sampling probability mass.

Non-streaming response

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1718000000,
  "status": "completed",
  "instructions": null,
  "max_output_tokens": 256,
  "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
  "output": [
    {
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The image shows two cats on a red sofa."
        }
      ]
    }
  ],
  "output_text": "The image shows two cats on a red sofa.",
  "temperature": 0.0,
  "top_p": 1.0,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 38,
    "output_tokens": 12,
    "total_tokens": 50
  }
}

Streaming events

When stream: true, the server emits a sequence of Server-Sent Events that mirror the OpenAI Responses streaming protocol:

Event type	Description
`response.created`	Emitted once when the response object is first created.
`response.in_progress`	Emitted once when generation starts.
`response.output_item.added`	A new output message item has been added.
`response.content_part.added`	A new content part (text block) has been opened.
`response.output_text.delta`	A text token delta. The `delta` field contains the new text segment.
`response.output_text.done`	Full text for the content part is complete.
`response.content_part.done`	The content part has closed.
`response.output_item.done`	The output message item has completed.
`response.completed`	Final event; includes full usage stats.

Example

curl -X POST "http://localhost:8080/responses" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "input": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "What is in this image?"},
          {"type": "input_image", "image_url": "/path/to/image.jpg"}
        ]
      }
    ],
    "max_output_tokens": 100
  }'

`GET /health`

Check whether the server is running and which model (if any) is currently loaded.

Response

{
  "status": "healthy",
  "loaded_model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
  "loaded_adapter": null
}

status

string

Always "healthy" when the server is reachable.

loaded_model

string | null

Path or repo ID of the currently loaded model, or null if no model is loaded.

loaded_adapter

string | null

Path to the currently loaded adapter, or null.

Example

curl "http://localhost:8080/health"

`POST /unload`

Unload the currently loaded model from memory and clear the MLX cache. Useful for freeing GPU/unified memory before loading a different model.

Response — model was loaded

{
  "status": "success",
  "message": "Model unloaded successfully",
  "unloaded": {
    "model_name": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "adapter_name": null
  }
}

Response — no model was loaded

{
  "status": "no_model_loaded",
  "message": "No model is currently loaded"
}

status

string

"success" when a model was unloaded, "no_model_loaded" otherwise.

unloaded

object | null

Show Unloaded model info

model_name

string | null

The path or repo ID of the model that was unloaded.

adapter_name

string | null

The adapter path that was unloaded, or null.

Example

curl -X POST "http://localhost:8080/unload"

Server environment variables

Several generation parameters can be set via environment variables. These apply to every request handled by the server:

Variable	Default	Description
`PREFILL_STEP_SIZE`	`2048`	Tokens processed per prefill step. Lower values reduce peak memory.
`KV_BITS`	`0` (disabled)	Bits for KV cache quantization. Set to `4` or `8` to enable.
`KV_GROUP_SIZE`	`64`	Group size for KV cache quantization.
`MAX_KV_SIZE`	`0` (disabled)	Maximum KV cache size in tokens.
`QUANTIZED_KV_START`	`5000`	Token index at which KV cache quantization starts.
`MLX_TRUST_REMOTE_CODE`	`false`	Set to `true` to trust remote code for all loaded models.
`PRELOAD_MODEL`	—	Model path or repo ID to load at server startup.
`PRELOAD_ADAPTER`	—	Adapter path to load at server startup alongside `PRELOAD_MODEL`.

# Example: start with KV cache quantization enabled
KV_BITS=4 mlx_vlm.server --model mlx-community/Qwen2.5-VL-3B-Instruct-8bit

The server keeps exactly one model loaded at a time. Sending a request with a different model value automatically evicts the current model before loading the new one.

Python API

REST API

REST API endpoints

Starting the server

`GET /models`

Response

Example

`POST /chat/completions`

Request body

Non-streaming response

Streaming response (SSE)

Examples

`POST /responses`

Request body

Non-streaming response

Streaming events

Example

`GET /health`

Response

Example

`POST /unload`

Response — model was loaded

Response — no model was loaded

Example

Server environment variables

Build docs developers (and LLMs) love

Python API

REST API

​Starting the server

​GET /models

​Response

​Example

​POST /chat/completions

​Request body

​Non-streaming response

​Streaming response (SSE)

​Examples

​POST /responses

​Request body

​Non-streaming response

​Streaming events

​Example

​GET /health

​Response

​Example

​POST /unload

​Response — model was loaded

​Response — no model was loaded

​Example

​Server environment variables

Build docs developers (and LLMs) love

Starting the server

`GET /models`

Response

Example

`POST /chat/completions`

Request body

Non-streaming response

Streaming response (SSE)

Examples

`POST /responses`

Request body

Non-streaming response

Streaming events

Example

`GET /health`

Response

Example

`POST /unload`

Response — model was loaded

Response — no model was loaded

Example

Server environment variables