mlx_vlm.server starts a FastAPI server that exposes an OpenAI-compatible chat completions API. Any client that works with the OpenAI API format can send requests to it.
Starting the server
Start with default settings
model field triggers a download and load automatically.Server flags
| Flag | Type | Default | Description |
|---|---|---|---|
--model | string | None | Hugging Face repo ID or local path to preload at startup |
--adapter-path | string | None | Path to LoRA adapter weights for the preloaded model |
--host | string | 0.0.0.0 | Host address to bind the server to |
--port | int | 8080 | Port number to listen on |
--trust-remote-code | flag | False | Trust remote code when loading models from Hugging Face Hub |
--prefill-step-size | int | 2048 | Tokens processed per prefill chunk; lower values reduce peak memory |
--kv-bits | int | 0 (disabled) | Quantize the KV cache to this many bits (e.g. 4 or 8) |
--kv-group-size | int | 64 | Group size for KV cache quantization |
--max-kv-size | int | 0 (disabled) | Maximum KV cache size in tokens |
--quantized-kv-start | int | 5000 | Token index at which KV cache quantization starts |
Trust remote code via environment variable
Endpoints
| Method | Path | Description |
|---|---|---|
GET | /models | List models available in the local Hugging Face cache |
GET | /v1/models | Same as /models (OpenAI-compatible path) |
POST | /chat/completions | Chat-style generation with text, image, and audio support |
POST | /v1/chat/completions | Same as /chat/completions (OpenAI-compatible path) |
POST | /responses | OpenAI Responses API-compatible endpoint |
POST | /v1/responses | Same as /responses (OpenAI-compatible path) |
GET | /health | Check whether the server is running |
POST | /unload | Unload the current model from memory |
Request parameters
The model to use for this request. Accepts a Hugging Face repo ID or a local path. If the model is not already loaded, the server loads it on demand.
The conversation history in OpenAI message format. Each message has a
role (system, user, or assistant) and content. Content may be a plain string or an array of typed objects for multi-modal inputs.Maximum number of tokens to generate.
Sampling temperature.
0 uses greedy decoding.Nucleus sampling threshold.
Top-k sampling cutoff.
0 disables top-k filtering.Minimum token probability relative to the highest-probability token.
Penalty applied to repeated tokens. Values above
1.0 discourage repetition.When
true, the server streams tokens using server-sent events (SSE) as they are generated.curl examples
OpenAI compatibility
The/v1/chat/completions endpoint follows the OpenAI Chat Completions API format. You can point any OpenAI-compatible client at http://localhost:8080 by setting the base URL:
stream=True parameter. The server sends tokens as server-sent events (SSE) using the text/event-stream content type.