Thinking models

Some vision-language models generate a chain-of-thought reasoning trace before producing their final answer. This internal reasoning — called the “thinking block” — is delimited by special tokens (e.g. <think> and </think>). The model works through the problem step by step inside the block, then emits its answer after the closing token. MLX-VLM exposes flags to activate thinking mode and to limit how many tokens the model is allowed to spend reasoning. This is useful for controlling latency and cost when you don’t need an exhaustive reasoning trace.

Flags

Flag	Description
`--enable-thinking`	Activate thinking mode in the chat template
`--thinking-budget`	Maximum tokens allowed inside the thinking block
`--thinking-start-token`	Token that opens the thinking block (default: `<think>`)
`--thinking-end-token`	Token that closes the thinking block (default: `</think>`)

How the budget works

When --thinking-budget N is set, MLX-VLM counts tokens generated after --thinking-start-token. Once the model has generated N tokens inside the thinking block, it is forced to emit \n</think> (or whatever --thinking-end-token is set to) and then continues with the answer. If --enable-thinking is passed but the model’s chat template does not natively support a thinking mode, the budget is still enforced — but only if the model happens to generate the start token on its own.

Not all models support --enable-thinking. It instructs the chat template to activate thinking mode. If the template doesn’t have a thinking mode, the flag has no effect on the template itself, but budget enforcement still applies if the model self-generates the start token.

CLI example

mlx_vlm.generate \
  --model mlx-community/Qwen3.5-2B-4bit \
  --enable-thinking \
  --thinking-budget 50 \
  --thinking-start-token "<think>" \
  --thinking-end-token "</think>" \
  --prompt "Solve 2+2"

In this example, the model is allowed up to 50 tokens of reasoning before being forced to close the thinking block and provide its answer.

Adjusting the budget

Short budget (fast)
Large budget (thorough)
No budget (unlimited)

A small budget keeps latency low. Use this when you need a quick answer and the problem is straightforward.

mlx_vlm.generate \
  --model mlx-community/Qwen3.5-2B-4bit \
  --enable-thinking \
  --thinking-budget 32 \
  --prompt "What is the capital of France?"

A larger budget allows the model to reason through complex problems more fully.

mlx_vlm.generate \
  --model mlx-community/Qwen3.5-2B-4bit \
  --enable-thinking \
  --thinking-budget 512 \
  --prompt "Explain the tradeoffs between LoRA and QLoRA fine-tuning."

Omitting --thinking-budget lets the model reason for as long as it naturally needs to. This can produce the most thorough answers but may increase latency significantly.

mlx_vlm.generate \
  --model mlx-community/Qwen3.5-2B-4bit \
  --enable-thinking \
  --prompt "Solve this step by step: if a train travels at 80 km/h for 2.5 hours, how far does it go?"

Custom thinking tokens

Some models use different tokens to delimit the thinking block. You can override the defaults:

mlx_vlm.generate \
  --model /path/to/custom-reasoning-model \
  --enable-thinking \
  --thinking-budget 100 \
  --thinking-start-token "<reasoning>" \
  --thinking-end-token "</reasoning>" \
  --prompt "What is shown in this image?" \
  --image /path/to/image.jpg

Make sure --thinking-start-token and --thinking-end-token match the tokens the model actually uses. If they don’t match, budget enforcement won’t trigger and --enable-thinking may produce unexpected output.

Supported models

Thinking mode has been tested with the following models:

Qwen3.5

Pass --enable-thinking to activate the thinking mode in Qwen3.5’s chat template. Use <think> / </think> (the defaults).

Phi-4 Reasoning Vision

Phi-4 Reasoning Vision supports chain-of-thought reasoning for visual tasks. See the model-specific guide for prompt formats.

Other models may generate thinking blocks on their own without needing --enable-thinking. For those models, --thinking-budget still applies as long as the model emits the start token.

Get Started

Inference

Fine-Tuning

Advanced

Models

Thinking models

Flags

How the budget works

CLI example

Adjusting the budget

Custom thinking tokens

Supported models

Qwen3.5

Phi-4 Reasoning Vision

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Flags

​How the budget works

​CLI example

​Adjusting the budget

​Custom thinking tokens

​Supported models

Qwen3.5

Phi-4 Reasoning Vision

Build docs developers (and LLMs) love

Flags

How the budget works

CLI example

Adjusting the budget

Custom thinking tokens

Supported models