<think> and </think>). The model works through the problem step by step inside the block, then emits its answer after the closing token.
MLX-VLM exposes flags to activate thinking mode and to limit how many tokens the model is allowed to spend reasoning. This is useful for controlling latency and cost when you don’t need an exhaustive reasoning trace.
Flags
| Flag | Description |
|---|---|
--enable-thinking | Activate thinking mode in the chat template |
--thinking-budget | Maximum tokens allowed inside the thinking block |
--thinking-start-token | Token that opens the thinking block (default: <think>) |
--thinking-end-token | Token that closes the thinking block (default: </think>) |
How the budget works
When--thinking-budget N is set, MLX-VLM counts tokens generated after --thinking-start-token. Once the model has generated N tokens inside the thinking block, it is forced to emit \n</think> (or whatever --thinking-end-token is set to) and then continues with the answer.
If --enable-thinking is passed but the model’s chat template does not natively support a thinking mode, the budget is still enforced — but only if the model happens to generate the start token on its own.
Not all models support
--enable-thinking. It instructs the chat template to activate thinking mode. If the template doesn’t have a thinking mode, the flag has no effect on the template itself, but budget enforcement still applies if the model self-generates the start token.CLI example
Adjusting the budget
- Short budget (fast)
- Large budget (thorough)
- No budget (unlimited)
A small budget keeps latency low. Use this when you need a quick answer and the problem is straightforward.
Custom thinking tokens
Some models use different tokens to delimit the thinking block. You can override the defaults:Supported models
Thinking mode has been tested with the following models:Qwen3.5
Pass
--enable-thinking to activate the thinking mode in Qwen3.5’s chat template. Use <think> / </think> (the defaults).Phi-4 Reasoning Vision
Phi-4 Reasoning Vision supports chain-of-thought reasoning for visual tasks. See the model-specific guide for prompt formats.
Other models may generate thinking blocks on their own without needing
--enable-thinking. For those models, --thinking-budget still applies as long as the model emits the start token.