REST API server

mlx_vlm.server starts a FastAPI server that exposes an OpenAI-compatible chat completions API. Any client that works with the OpenAI API format can send requests to it.

Starting the server

Start with default settings

mlx_vlm.server --port 8080

The server starts without loading any model. The first request that includes a model field triggers a download and load automatically.

Preload a model at startup

Pass --model to load a specific model before any requests arrive:

mlx_vlm.server --model mlx-community/Qwen2-VL-2B-Instruct-4bit --port 8080

Use an adapter

Combine --model with --adapter-path to load LoRA weights alongside the base model:

mlx_vlm.server \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --adapter-path ./my-lora-adapter \
  --port 8080

Server flags

Flag	Type	Default	Description
`--model`	string	`None`	Hugging Face repo ID or local path to preload at startup
`--adapter-path`	string	`None`	Path to LoRA adapter weights for the preloaded model
`--host`	string	`0.0.0.0`	Host address to bind the server to
`--port`	int	`8080`	Port number to listen on
`--trust-remote-code`	flag	`False`	Trust remote code when loading models from Hugging Face Hub
`--prefill-step-size`	int	`2048`	Tokens processed per prefill chunk; lower values reduce peak memory
`--kv-bits`	int	`0` (disabled)	Quantize the KV cache to this many bits (e.g. `4` or `8`)
`--kv-group-size`	int	`64`	Group size for KV cache quantization
`--max-kv-size`	int	`0` (disabled)	Maximum KV cache size in tokens
`--quantized-kv-start`	int	`5000`	Token index at which KV cache quantization starts

Trust remote code via environment variable

MLX_TRUST_REMOTE_CODE=true mlx_vlm.server --port 8080

Endpoints

Method	Path	Description
`GET`	`/models`	List models available in the local Hugging Face cache
`GET`	`/v1/models`	Same as `/models` (OpenAI-compatible path)
`POST`	`/chat/completions`	Chat-style generation with text, image, and audio support
`POST`	`/v1/chat/completions`	Same as `/chat/completions` (OpenAI-compatible path)
`POST`	`/responses`	OpenAI Responses API-compatible endpoint
`POST`	`/v1/responses`	Same as `/responses` (OpenAI-compatible path)
`GET`	`/health`	Check whether the server is running
`POST`	`/unload`	Unload the current model from memory

The server caches one model at a time. When a request arrives for a different model, the current model is evicted from memory and the new one is loaded.

Request parameters

body.model

string

required

The model to use for this request. Accepts a Hugging Face repo ID or a local path. If the model is not already loaded, the server loads it on demand.

body.messages

array

required

The conversation history in OpenAI message format. Each message has a role (system, user, or assistant) and content. Content may be a plain string or an array of typed objects for multi-modal inputs.

body.max_tokens

integer

default:"None"

Maximum number of tokens to generate.

body.temperature

float

default:"0.0"

Sampling temperature. 0 uses greedy decoding.

body.top_p

float

default:"1.0"

Nucleus sampling threshold.

body.top_k

integer

default:"0"

Top-k sampling cutoff. 0 disables top-k filtering.

body.min_p

float

default:"0.0"

Minimum token probability relative to the highest-probability token.

body.repetition_penalty

float

default:"None"

Penalty applied to repeated tokens. Values above 1.0 discourage repetition.

body.stream

boolean

default:"false"

When true, the server streams tokens using server-sent events (SSE) as they are generated.

curl examples

curl "http://localhost:8080/models"

OpenAI compatibility

The /v1/chat/completions endpoint follows the OpenAI Chat Completions API format. You can point any OpenAI-compatible client at http://localhost:8080 by setting the base URL:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mlx-community/Qwen2-VL-2B-Instruct-4bit",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    max_tokens=50,
)
print(response.choices[0].message.content)

Streaming is supported via the standard stream=True parameter. The server sends tokens as server-sent events (SSE) using the text/event-stream content type.

The server loads one model at a time. If you send requests for multiple different models concurrently, each model swap unloads the previous one, which incurs download and initialization time.

Get Started

Inference

Fine-Tuning

Advanced

Models

REST API server

Starting the server

Server flags

Trust remote code via environment variable

Endpoints

Request parameters

curl examples

OpenAI compatibility

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Starting the server

​Server flags

​Trust remote code via environment variable

​Endpoints

​Request parameters

​curl examples

​OpenAI compatibility

Build docs developers (and LLMs) love

Starting the server

Server flags

Trust remote code via environment variable

Endpoints

Request parameters

curl examples

OpenAI compatibility