Skip to main content
MLX-VLM registers four CLI entry points when you install the package. Each one maps directly to a Python module’s main() function.
CommandModulePurpose
mlx_vlm.generatemlx_vlm.generateOne-shot text generation from text, images, or audio
mlx_vlm.chat_uimlx_vlm.chat_uiLaunch a Gradio chat interface
mlx_vlm.convertmlx_vlm.convertConvert and quantize Hugging Face checkpoints
mlx_vlm.servermlx_vlm.serverStart an OpenAI-compatible HTTP server

mlx_vlm.generate

The primary inference command. Supports text-only, image, audio, and multi-modal inputs.

Examples

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --prompt "Hello, how are you?"

Thinking budget

For reasoning models such as Qwen3.5, you can cap the number of tokens spent inside the thinking block:
mlx_vlm.generate \
  --model mlx-community/Qwen3.5-2B-4bit \
  --enable-thinking \
  --thinking-budget 50 \
  --thinking-start-token "<think>" \
  --thinking-end-token "</think>" \
  --prompt "Solve 2+2"
When the budget is exceeded, the model is forced to emit \n</think> and transition to its answer. If --enable-thinking is set but the model’s chat template does not support it, the budget is applied only when the model generates the start token on its own.

Activation quantization (CUDA)

Models quantized with mxfp8 or nvfp4 require activation quantization on NVIDIA GPUs. Use the -qa shorthand or the full flag:
mlx_vlm.generate \
  --model /path/to/mxfp8-model \
  --image /path/to/image.jpg \
  --prompt "Describe this image" \
  -qa
On Apple Silicon (Metal), mxfp8 and nvfp4 models work without the -qa flag.

Flag reference

FlagTypeDefaultDescription
--modelstringmlx-community/nanoLLaVA-1.5-8bitHugging Face repo ID or local model path
--adapter-pathstringNonePath to LoRA adapter weights
--imagestring (one or more)NoneURL(s) or local path(s) of images to process
--audiostring (one or more)NoneURL(s) or local path(s) of audio files to process
--resize-shapeint (one or two values)NoneResize images to this shape before processing
--promptstring"What are these?"Text prompt sent to the model
--systemstringNoneSystem message prepended to the conversation
--max-tokensint256Maximum number of tokens to generate
--temperaturefloat0.0Sampling temperature; 0 uses argmax (greedy)
--max-kv-sizeintNoneMaximum KV cache size for long-context prompts
--kv-bitsintNoneQuantize the KV cache to this many bits
--kv-group-sizeint64Group size used when quantizing the KV cache
--quantized-kv-startint5000Token index at which KV cache quantization begins
--prefill-step-sizeint2048Tokens processed per prefill chunk; lower values reduce peak memory
--enable-thinkingflagFalseActivate thinking mode in the chat template
--thinking-budgetintNoneMaximum tokens allowed inside a thinking block
--thinking-start-tokenstring<think>Token that opens a thinking block
--thinking-end-tokenstring</think>Token that closes a thinking block
--quantize-activations / -qaflagFalseEnable activation quantization for mxfp8/nvfp4 models
--processor-kwargsJSON string{}Extra kwargs forwarded to the processor (e.g. '{"cropping": false}')
--eos-tokensstring (one or more)NoneAdditional end-of-sequence tokens
--skip-special-tokensflagFalseOmit special tokens from the decoded output
--chatflagFalseEnter multi-turn chat mode
--verboseflagTruePrint tokens and timing statistics as they are generated
--trust-remote-codeflagFalseAllow execution of remote code when loading the model
--revisionstring"main"Model branch, tag, or commit to use
--force-downloadflagFalseRe-download the model even if it is already cached

mlx_vlm.chat_ui

Launches an interactive Gradio chat interface in your browser.
mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
The UI loads the model at startup and exposes a chat window that accepts text and image inputs. You can switch models from within the interface; the previous model is unloaded from memory before the new one is loaded.
The gradio package is an optional dependency. Install it with pip install 'mlx-vlm[ui]' before using this command.
FlagTypeDefaultDescription
--modelstringqnguyen3/nanoLLaVAHugging Face repo ID or local path of the model to load at startup

mlx_vlm.convert

Converts a Hugging Face Vision Language Model checkpoint to MLX format with optional quantization.
mlx_vlm.convert \
  --hf-path mlx-community/Qwen2-VL-2B-Instruct \
  --mlx-path ./Qwen2-VL-2B-Instruct-4bit \
  -q
See the Model conversion guide for the full flag reference and quantization options.

mlx_vlm.server

Starts an OpenAI-compatible FastAPI server. See the REST API server page for the complete reference.
mlx_vlm.server --port 8080

Build docs developers (and LLMs) love