Skip to main content
mlx_vlm.server starts a FastAPI server that exposes an OpenAI-compatible chat completions API. Any client that works with the OpenAI API format can send requests to it.

Starting the server

1

Start with default settings

mlx_vlm.server --port 8080
The server starts without loading any model. The first request that includes a model field triggers a download and load automatically.
2

Preload a model at startup

Pass --model to load a specific model before any requests arrive:
mlx_vlm.server --model mlx-community/Qwen2-VL-2B-Instruct-4bit --port 8080
3

Use an adapter

Combine --model with --adapter-path to load LoRA weights alongside the base model:
mlx_vlm.server \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --adapter-path ./my-lora-adapter \
  --port 8080

Server flags

FlagTypeDefaultDescription
--modelstringNoneHugging Face repo ID or local path to preload at startup
--adapter-pathstringNonePath to LoRA adapter weights for the preloaded model
--hoststring0.0.0.0Host address to bind the server to
--portint8080Port number to listen on
--trust-remote-codeflagFalseTrust remote code when loading models from Hugging Face Hub
--prefill-step-sizeint2048Tokens processed per prefill chunk; lower values reduce peak memory
--kv-bitsint0 (disabled)Quantize the KV cache to this many bits (e.g. 4 or 8)
--kv-group-sizeint64Group size for KV cache quantization
--max-kv-sizeint0 (disabled)Maximum KV cache size in tokens
--quantized-kv-startint5000Token index at which KV cache quantization starts

Trust remote code via environment variable

MLX_TRUST_REMOTE_CODE=true mlx_vlm.server --port 8080

Endpoints

MethodPathDescription
GET/modelsList models available in the local Hugging Face cache
GET/v1/modelsSame as /models (OpenAI-compatible path)
POST/chat/completionsChat-style generation with text, image, and audio support
POST/v1/chat/completionsSame as /chat/completions (OpenAI-compatible path)
POST/responsesOpenAI Responses API-compatible endpoint
POST/v1/responsesSame as /responses (OpenAI-compatible path)
GET/healthCheck whether the server is running
POST/unloadUnload the current model from memory
The server caches one model at a time. When a request arrives for a different model, the current model is evicted from memory and the new one is loaded.

Request parameters

body.model
string
required
The model to use for this request. Accepts a Hugging Face repo ID or a local path. If the model is not already loaded, the server loads it on demand.
body.messages
array
required
The conversation history in OpenAI message format. Each message has a role (system, user, or assistant) and content. Content may be a plain string or an array of typed objects for multi-modal inputs.
body.max_tokens
integer
default:"None"
Maximum number of tokens to generate.
body.temperature
float
default:"0.0"
Sampling temperature. 0 uses greedy decoding.
body.top_p
float
default:"1.0"
Nucleus sampling threshold.
body.top_k
integer
default:"0"
Top-k sampling cutoff. 0 disables top-k filtering.
body.min_p
float
default:"0.0"
Minimum token probability relative to the highest-probability token.
body.repetition_penalty
float
default:"None"
Penalty applied to repeated tokens. Values above 1.0 discourage repetition.
body.stream
boolean
default:"false"
When true, the server streams tokens using server-sent events (SSE) as they are generated.

curl examples

curl "http://localhost:8080/models"

OpenAI compatibility

The /v1/chat/completions endpoint follows the OpenAI Chat Completions API format. You can point any OpenAI-compatible client at http://localhost:8080 by setting the base URL:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mlx-community/Qwen2-VL-2B-Instruct-4bit",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    max_tokens=50,
)
print(response.choices[0].message.content)
Streaming is supported via the standard stream=True parameter. The server sends tokens as server-sent events (SSE) using the text/event-stream content type.
The server loads one model at a time. If you send requests for multiple different models concurrently, each model swap unloads the previous one, which incurs download and initialization time.

Build docs developers (and LLMs) love