CLI reference

MLX-VLM registers four CLI entry points when you install the package. Each one maps directly to a Python module’s main() function.

Command	Module	Purpose
`mlx_vlm.generate`	`mlx_vlm.generate`	One-shot text generation from text, images, or audio
`mlx_vlm.chat_ui`	`mlx_vlm.chat_ui`	Launch a Gradio chat interface
`mlx_vlm.convert`	`mlx_vlm.convert`	Convert and quantize Hugging Face checkpoints
`mlx_vlm.server`	`mlx_vlm.server`	Start an OpenAI-compatible HTTP server

mlx_vlm.generate

The primary inference command. Supports text-only, image, audio, and multi-modal inputs.

Examples

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --prompt "Hello, how are you?"

Thinking budget

For reasoning models such as Qwen3.5, you can cap the number of tokens spent inside the thinking block:

mlx_vlm.generate \
  --model mlx-community/Qwen3.5-2B-4bit \
  --enable-thinking \
  --thinking-budget 50 \
  --thinking-start-token "<think>" \
  --thinking-end-token "</think>" \
  --prompt "Solve 2+2"

When the budget is exceeded, the model is forced to emit \n</think> and transition to its answer. If --enable-thinking is set but the model’s chat template does not support it, the budget is applied only when the model generates the start token on its own.

Activation quantization (CUDA)

Models quantized with mxfp8 or nvfp4 require activation quantization on NVIDIA GPUs. Use the -qa shorthand or the full flag:

mlx_vlm.generate \
  --model /path/to/mxfp8-model \
  --image /path/to/image.jpg \
  --prompt "Describe this image" \
  -qa

On Apple Silicon (Metal), mxfp8 and nvfp4 models work without the -qa flag.

Flag reference

Flag	Type	Default	Description
`--model`	string	`mlx-community/nanoLLaVA-1.5-8bit`	Hugging Face repo ID or local model path
`--adapter-path`	string	`None`	Path to LoRA adapter weights
`--image`	string (one or more)	`None`	URL(s) or local path(s) of images to process
`--audio`	string (one or more)	`None`	URL(s) or local path(s) of audio files to process
`--resize-shape`	int (one or two values)	`None`	Resize images to this shape before processing
`--prompt`	string	`"What are these?"`	Text prompt sent to the model
`--system`	string	`None`	System message prepended to the conversation
`--max-tokens`	int	`256`	Maximum number of tokens to generate
`--temperature`	float	`0.0`	Sampling temperature; `0` uses argmax (greedy)
`--max-kv-size`	int	`None`	Maximum KV cache size for long-context prompts
`--kv-bits`	int	`None`	Quantize the KV cache to this many bits
`--kv-group-size`	int	`64`	Group size used when quantizing the KV cache
`--quantized-kv-start`	int	`5000`	Token index at which KV cache quantization begins
`--prefill-step-size`	int	`2048`	Tokens processed per prefill chunk; lower values reduce peak memory
`--enable-thinking`	flag	`False`	Activate thinking mode in the chat template
`--thinking-budget`	int	`None`	Maximum tokens allowed inside a thinking block
`--thinking-start-token`	string	`<think>`	Token that opens a thinking block
`--thinking-end-token`	string	`</think>`	Token that closes a thinking block
`--quantize-activations` / `-qa`	flag	`False`	Enable activation quantization for `mxfp8`/`nvfp4` models
`--processor-kwargs`	JSON string	`{}`	Extra kwargs forwarded to the processor (e.g. `'{"cropping": false}'`)
`--eos-tokens`	string (one or more)	`None`	Additional end-of-sequence tokens
`--skip-special-tokens`	flag	`False`	Omit special tokens from the decoded output
`--chat`	flag	`False`	Enter multi-turn chat mode
`--verbose`	flag	`True`	Print tokens and timing statistics as they are generated
`--trust-remote-code`	flag	`False`	Allow execution of remote code when loading the model
`--revision`	string	`"main"`	Model branch, tag, or commit to use
`--force-download`	flag	`False`	Re-download the model even if it is already cached

mlx_vlm.chat_ui

Launches an interactive Gradio chat interface in your browser.

mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

The UI loads the model at startup and exposes a chat window that accepts text and image inputs. You can switch models from within the interface; the previous model is unloaded from memory before the new one is loaded.

The gradio package is an optional dependency. Install it with pip install 'mlx-vlm[ui]' before using this command.

Flag	Type	Default	Description
`--model`	string	`qnguyen3/nanoLLaVA`	Hugging Face repo ID or local path of the model to load at startup

mlx_vlm.convert

Converts a Hugging Face Vision Language Model checkpoint to MLX format with optional quantization.

mlx_vlm.convert \
  --hf-path mlx-community/Qwen2-VL-2B-Instruct \
  --mlx-path ./Qwen2-VL-2B-Instruct-4bit \
  -q

See the Model conversion guide for the full flag reference and quantization options.

mlx_vlm.server

Starts an OpenAI-compatible FastAPI server. See the REST API server page for the complete reference.

mlx_vlm.server --port 8080

Get Started

Inference

Fine-Tuning

Advanced

Models

mlx_vlm.generate

Examples

Thinking budget

Activation quantization (CUDA)

Flag reference

mlx_vlm.chat_ui

mlx_vlm.convert

mlx_vlm.server

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​mlx_vlm.generate

​Examples

​Thinking budget

​Activation quantization (CUDA)

​Flag reference

​mlx_vlm.chat_ui

​mlx_vlm.convert

​mlx_vlm.server

Build docs developers (and LLMs) love

mlx_vlm.generate

Examples

Thinking budget

Activation quantization (CUDA)

Flag reference

mlx_vlm.chat_ui

mlx_vlm.convert

mlx_vlm.server