MLX-VLM registers four CLI entry points when you install the package. Each one maps directly to a Python module’s main() function.
Command Module Purpose mlx_vlm.generatemlx_vlm.generateOne-shot text generation from text, images, or audio mlx_vlm.chat_uimlx_vlm.chat_uiLaunch a Gradio chat interface mlx_vlm.convertmlx_vlm.convertConvert and quantize Hugging Face checkpoints mlx_vlm.servermlx_vlm.serverStart an OpenAI-compatible HTTP server
mlx_vlm.generate
The primary inference command. Supports text-only, image, audio, and multi-modal inputs.
Examples
Text generation
Image generation
Audio generation
Multi-modal (image + audio)
mlx_vlm.generate \
--model mlx-community/Qwen2-VL-2B-Instruct-4bit \
--max-tokens 100 \
--prompt "Hello, how are you?"
Thinking budget
For reasoning models such as Qwen3.5, you can cap the number of tokens spent inside the thinking block:
mlx_vlm.generate \
--model mlx-community/Qwen3.5-2B-4bit \
--enable-thinking \
--thinking-budget 50 \
--thinking-start-token "<think>" \
--thinking-end-token "</think>" \
--prompt "Solve 2+2"
When the budget is exceeded, the model is forced to emit \n</think> and transition to its answer. If --enable-thinking is set but the model’s chat template does not support it, the budget is applied only when the model generates the start token on its own.
Activation quantization (CUDA)
Models quantized with mxfp8 or nvfp4 require activation quantization on NVIDIA GPUs. Use the -qa shorthand or the full flag:
mlx_vlm.generate \
--model /path/to/mxfp8-model \
--image /path/to/image.jpg \
--prompt "Describe this image" \
-qa
On Apple Silicon (Metal), mxfp8 and nvfp4 models work without the -qa flag.
Flag reference
Flag Type Default Description --modelstring mlx-community/nanoLLaVA-1.5-8bitHugging Face repo ID or local model path --adapter-pathstring NonePath to LoRA adapter weights --imagestring (one or more) NoneURL(s) or local path(s) of images to process --audiostring (one or more) NoneURL(s) or local path(s) of audio files to process --resize-shapeint (one or two values) NoneResize images to this shape before processing --promptstring "What are these?"Text prompt sent to the model --systemstring NoneSystem message prepended to the conversation --max-tokensint 256Maximum number of tokens to generate --temperaturefloat 0.0Sampling temperature; 0 uses argmax (greedy) --max-kv-sizeint NoneMaximum KV cache size for long-context prompts --kv-bitsint NoneQuantize the KV cache to this many bits --kv-group-sizeint 64Group size used when quantizing the KV cache --quantized-kv-startint 5000Token index at which KV cache quantization begins --prefill-step-sizeint 2048Tokens processed per prefill chunk; lower values reduce peak memory --enable-thinkingflag FalseActivate thinking mode in the chat template --thinking-budgetint NoneMaximum tokens allowed inside a thinking block --thinking-start-tokenstring <think>Token that opens a thinking block --thinking-end-tokenstring </think>Token that closes a thinking block --quantize-activations / -qaflag FalseEnable activation quantization for mxfp8/nvfp4 models --processor-kwargsJSON string {}Extra kwargs forwarded to the processor (e.g. '{"cropping": false}') --eos-tokensstring (one or more) NoneAdditional end-of-sequence tokens --skip-special-tokensflag FalseOmit special tokens from the decoded output --chatflag FalseEnter multi-turn chat mode --verboseflag TruePrint tokens and timing statistics as they are generated --trust-remote-codeflag FalseAllow execution of remote code when loading the model --revisionstring "main"Model branch, tag, or commit to use --force-downloadflag FalseRe-download the model even if it is already cached
mlx_vlm.chat_ui
Launches an interactive Gradio chat interface in your browser.
mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
The UI loads the model at startup and exposes a chat window that accepts text and image inputs. You can switch models from within the interface; the previous model is unloaded from memory before the new one is loaded.
The gradio package is an optional dependency. Install it with pip install 'mlx-vlm[ui]' before using this command.
Flag Type Default Description --modelstring qnguyen3/nanoLLaVAHugging Face repo ID or local path of the model to load at startup
mlx_vlm.convert
Converts a Hugging Face Vision Language Model checkpoint to MLX format with optional quantization.
mlx_vlm.convert \
--hf-path mlx-community/Qwen2-VL-2B-Instruct \
--mlx-path ./Qwen2-VL-2B-Instruct-4bit \
-q
See the Model conversion guide for the full flag reference and quantization options.
mlx_vlm.server
Starts an OpenAI-compatible FastAPI server. See the REST API server page for the complete reference.
mlx_vlm.server --port 8080