Skip to main content

Starting the server

# Default host 0.0.0.0, port 8080
mlx_vlm.server

# Preload a model at startup
mlx_vlm.server --model mlx-community/Qwen2-VL-2B-Instruct-4bit --port 8080

# With a LoRA adapter
mlx_vlm.server --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --adapter-path ./my-adapter

# Trust remote code (required by some models)
mlx_vlm.server --trust-remote-code
The server loads models on the first request if no --model is given at startup. Only one model is kept in memory at a time; requesting a different model evicts the current one.

GET /models

List all MLX models that are already downloaded in the local Hugging Face cache. Aliases: GET /v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
      "object": "model",
      "created": 1718000000
    },
    {
      "id": "mlx-community/Qwen2.5-VL-3B-Instruct-8bit",
      "object": "model",
      "created": 1719000000
    }
  ]
}
object
string
Always "list".
data
array

Example

curl "http://localhost:8080/models"

POST /chat/completions

OpenAI-compatible chat endpoint. Accepts a list of messages with optional image and audio content. Supports both streaming (Server-Sent Events) and non-streaming responses. Aliases: POST /v1/chat/completions

Request body

model
string
required
Hugging Face repository ID or local path to the model (e.g. "mlx-community/Qwen2-VL-2B-Instruct-4bit"). The model is loaded on first use and cached until a different model is requested.
messages
array
required
Conversation history. Each element is a message object:
max_tokens
integer
default:"256"
Maximum number of tokens to generate.
stream
boolean
default:"false"
When true, the response is streamed as Server-Sent Events. Each event is a JSON object prefixed with data: . The stream ends with data: [DONE].
temperature
float
default:"0.0"
Sampling temperature. 0.0 uses greedy decoding.
top_p
float
default:"1.0"
Nucleus sampling probability mass.
top_k
integer
default:"None"
Restrict sampling to the top-k tokens.
min_p
float
default:"None"
Minimum probability threshold relative to the most likely token.
repetition_penalty
float
default:"None"
Penalty for repeated tokens. Values above 1.0 discourage repetition.
seed
integer
default:"0"
Random seed for reproducible generation.
adapter_path
string
default:"None"
Path to LoRA adapter weights to apply on top of the base model.
resize_shape
array
default:"None"
Resize images before processing. Pass [size] for a square or [height, width] for a specific shape.
enable_thinking
boolean
default:"false"
Enable thinking mode in the chat template (e.g. for Qwen3.5 reasoning models).
thinking_budget
integer
default:"None"
Maximum tokens allowed inside a thinking block. Requires enable_thinking: true.

Non-streaming response

{
  "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
  "choices": [
    {
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "The image shows two cats sleeping on a red sofa."
      }
    }
  ],
  "usage": {
    "input_tokens": 42,
    "output_tokens": 18,
    "total_tokens": 60,
    "prompt_tps": 310.5,
    "generation_tps": 42.3,
    "peak_memory": 3.84
  }
}

Streaming response (SSE)

Each chunk is sent as data: <json>\n\n. The stream ends with data: [DONE]\n\n.
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1718000000,"model":"mlx-community/Qwen2-VL-2B-Instruct-4bit","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":"The"}}],"usage":{"input_tokens":42,"output_tokens":1,...}}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...,"choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":" image"}}],...}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk",...,"choices":[{"index":0,"finish_reason":"stop","delta":{"role":"assistant","content":""}}],...}

data: [DONE]

Examples

curl -X POST "http://localhost:8080/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you?"
      }
    ],
    "max_tokens": 100,
    "stream": true
  }'

POST /responses

OpenAI Responses API-compatible endpoint. Mirrors the structure of the OpenAI Python SDK’s client.responses.create method. Aliases: POST /v1/responses

Request body

model
string
required
Model identifier (Hugging Face repo ID or local path).
input
string | array
required
Input text string, or a list of message objects in the same format as /chat/completions. When a system role message is present it is used as the model’s system instructions.
max_output_tokens
integer
default:"256"
Maximum number of tokens to generate.
stream
boolean
default:"false"
Stream the response using the OpenAI Responses streaming event format.
temperature
float
default:"0.0"
Sampling temperature.
top_p
float
default:"1.0"
Nucleus sampling probability mass.

Non-streaming response

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1718000000,
  "status": "completed",
  "instructions": null,
  "max_output_tokens": 256,
  "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
  "output": [
    {
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The image shows two cats on a red sofa."
        }
      ]
    }
  ],
  "output_text": "The image shows two cats on a red sofa.",
  "temperature": 0.0,
  "top_p": 1.0,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 38,
    "output_tokens": 12,
    "total_tokens": 50
  }
}

Streaming events

When stream: true, the server emits a sequence of Server-Sent Events that mirror the OpenAI Responses streaming protocol:
Event typeDescription
response.createdEmitted once when the response object is first created.
response.in_progressEmitted once when generation starts.
response.output_item.addedA new output message item has been added.
response.content_part.addedA new content part (text block) has been opened.
response.output_text.deltaA text token delta. The delta field contains the new text segment.
response.output_text.doneFull text for the content part is complete.
response.content_part.doneThe content part has closed.
response.output_item.doneThe output message item has completed.
response.completedFinal event; includes full usage stats.

Example

curl -X POST "http://localhost:8080/responses" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "input": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "What is in this image?"},
          {"type": "input_image", "image_url": "/path/to/image.jpg"}
        ]
      }
    ],
    "max_output_tokens": 100
  }'

GET /health

Check whether the server is running and which model (if any) is currently loaded.

Response

{
  "status": "healthy",
  "loaded_model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
  "loaded_adapter": null
}
status
string
Always "healthy" when the server is reachable.
loaded_model
string | null
Path or repo ID of the currently loaded model, or null if no model is loaded.
loaded_adapter
string | null
Path to the currently loaded adapter, or null.

Example

curl "http://localhost:8080/health"

POST /unload

Unload the currently loaded model from memory and clear the MLX cache. Useful for freeing GPU/unified memory before loading a different model.

Response — model was loaded

{
  "status": "success",
  "message": "Model unloaded successfully",
  "unloaded": {
    "model_name": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "adapter_name": null
  }
}

Response — no model was loaded

{
  "status": "no_model_loaded",
  "message": "No model is currently loaded"
}
status
string
"success" when a model was unloaded, "no_model_loaded" otherwise.
unloaded
object | null

Example

curl -X POST "http://localhost:8080/unload"

Server environment variables

Several generation parameters can be set via environment variables. These apply to every request handled by the server:
VariableDefaultDescription
PREFILL_STEP_SIZE2048Tokens processed per prefill step. Lower values reduce peak memory.
KV_BITS0 (disabled)Bits for KV cache quantization. Set to 4 or 8 to enable.
KV_GROUP_SIZE64Group size for KV cache quantization.
MAX_KV_SIZE0 (disabled)Maximum KV cache size in tokens.
QUANTIZED_KV_START5000Token index at which KV cache quantization starts.
MLX_TRUST_REMOTE_CODEfalseSet to true to trust remote code for all loaded models.
PRELOAD_MODELModel path or repo ID to load at server startup.
PRELOAD_ADAPTERAdapter path to load at server startup alongside PRELOAD_MODEL.
# Example: start with KV cache quantization enabled
KV_BITS=4 mlx_vlm.server --model mlx-community/Qwen2.5-VL-3B-Instruct-8bit
The server keeps exactly one model loaded at a time. Sending a request with a different model value automatically evicts the current model before loading the new one.

Build docs developers (and LLMs) love