Skip to main content
Some vision-language models generate a chain-of-thought reasoning trace before producing their final answer. This internal reasoning — called the “thinking block” — is delimited by special tokens (e.g. <think> and </think>). The model works through the problem step by step inside the block, then emits its answer after the closing token. MLX-VLM exposes flags to activate thinking mode and to limit how many tokens the model is allowed to spend reasoning. This is useful for controlling latency and cost when you don’t need an exhaustive reasoning trace.

Flags

FlagDescription
--enable-thinkingActivate thinking mode in the chat template
--thinking-budgetMaximum tokens allowed inside the thinking block
--thinking-start-tokenToken that opens the thinking block (default: <think>)
--thinking-end-tokenToken that closes the thinking block (default: </think>)

How the budget works

When --thinking-budget N is set, MLX-VLM counts tokens generated after --thinking-start-token. Once the model has generated N tokens inside the thinking block, it is forced to emit \n</think> (or whatever --thinking-end-token is set to) and then continues with the answer. If --enable-thinking is passed but the model’s chat template does not natively support a thinking mode, the budget is still enforced — but only if the model happens to generate the start token on its own.
Not all models support --enable-thinking. It instructs the chat template to activate thinking mode. If the template doesn’t have a thinking mode, the flag has no effect on the template itself, but budget enforcement still applies if the model self-generates the start token.

CLI example

mlx_vlm.generate \
  --model mlx-community/Qwen3.5-2B-4bit \
  --enable-thinking \
  --thinking-budget 50 \
  --thinking-start-token "<think>" \
  --thinking-end-token "</think>" \
  --prompt "Solve 2+2"
In this example, the model is allowed up to 50 tokens of reasoning before being forced to close the thinking block and provide its answer.

Adjusting the budget

A small budget keeps latency low. Use this when you need a quick answer and the problem is straightforward.
mlx_vlm.generate \
  --model mlx-community/Qwen3.5-2B-4bit \
  --enable-thinking \
  --thinking-budget 32 \
  --prompt "What is the capital of France?"

Custom thinking tokens

Some models use different tokens to delimit the thinking block. You can override the defaults:
mlx_vlm.generate \
  --model /path/to/custom-reasoning-model \
  --enable-thinking \
  --thinking-budget 100 \
  --thinking-start-token "<reasoning>" \
  --thinking-end-token "</reasoning>" \
  --prompt "What is shown in this image?" \
  --image /path/to/image.jpg
Make sure --thinking-start-token and --thinking-end-token match the tokens the model actually uses. If they don’t match, budget enforcement won’t trigger and --enable-thinking may produce unexpected output.

Supported models

Thinking mode has been tested with the following models:

Qwen3.5

Pass --enable-thinking to activate the thinking mode in Qwen3.5’s chat template. Use <think> / </think> (the defaults).

Phi-4 Reasoning Vision

Phi-4 Reasoning Vision supports chain-of-thought reasoning for visual tasks. See the model-specific guide for prompt formats.
Other models may generate thinking blocks on their own without needing --enable-thinking. For those models, --thinking-budget still applies as long as the model emits the start token.

Build docs developers (and LLMs) love