Prompt utilities

`apply_chat_template()`

Format a prompt (or conversation history) into the string that the model’s tokenizer expects. Internally looks up the model type from the config, builds message objects in the correct format for that model, then applies the tokenizer’s chat template.

from mlx_vlm.prompt_utils import apply_chat_template

formatted = apply_chat_template(processor, config, "Describe this image.", num_images=1)

Signature

def apply_chat_template(
    processor,
    config: Union[Dict[str, Any], Any],
    prompt: Union[str, Dict[str, Any], List[Any]],
    add_generation_prompt: bool = True,
    return_messages: bool = False,
    num_images: int = 0,
    num_audios: int = 0,
    **kwargs,
) -> Union[List[Dict[str, Any]], str, Any]:

Parameters

processor

ProcessorMixin

required

The processor returned by load(). Used to apply the tokenizer’s chat template.

config

dict | Any

required

The model configuration dict or config object. Must contain a model_type key. Retrieve with model.config or load_config().

prompt

str | dict | list

required

The user input. Three forms are accepted:

str — A single text prompt.
dict — A single message with role and content keys.
list — A conversation history: a list of strings, dicts, or objects with .role / .content attributes.

add_generation_prompt

bool

default:"True"

Append the model’s generation-prompt suffix (e.g. <|im_start|>assistant) so the model knows to generate a response.

return_messages

bool

default:"False"

When True, return the intermediate list of message dicts instead of the fully rendered chat template string. Useful for inspection or custom post-processing.

num_images

int

default:"0"

Number of images in the input. Image tokens are injected into the first user message.

num_audios

int

default:"0"

Number of audio files in the input. Audio tokens are injected into the first user message.

enable_thinking

bool

default:"False"

Pass enable_thinking=True to the underlying chat template for reasoning models such as Qwen3.5.

video

str

default:"None"

Path or URL to a video file. When provided, a video message is constructed instead of image tokens. Supported by Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and related models.

max_pixels

int

default:"50176"

Maximum pixels per video frame (224 * 224 = 50176 by default). Used with video.

fps

int

default:"1"

Frames per second to extract from the video. Used with video.

Returns

formatted_prompt

str | list

The chat-template-rendered string ready to pass to generate(). When return_messages=True, returns the list of message dicts before template rendering. For paligemma, molmo, and florence2 model types, returns only the last message dict.

Examples

Text only
With images
With audio
With video
Multi-turn conversation

from mlx_vlm import load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")

prompt = apply_chat_template(
    processor, model.config, "What is the capital of France?"
)

from mlx_vlm import load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")

images = ["chart.png", "table.png"]
prompt = apply_chat_template(
    processor,
    model.config,
    "Compare these two charts.",
    num_images=len(images),
)

from mlx_vlm import load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/gemma-3n-E2B-it-4bit")

audio = ["/path/to/recording.wav"]
prompt = apply_chat_template(
    processor,
    model.config,
    "Transcribe this audio.",
    num_audios=len(audio),
)

from mlx_vlm import load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")

prompt = apply_chat_template(
    processor,
    model.config,
    "Summarise this video.",
    video="clip.mp4",
    max_pixels=224 * 224,
    fps=1,
)

from mlx_vlm import load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/Qwen2.5-VL-3B-Instruct-8bit")

history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is MLX?"},
    {"role": "assistant", "content": "MLX is Apple's machine learning framework."},
    {"role": "user", "content": "How does it differ from PyTorch?"},
]

prompt = apply_chat_template(processor, model.config, history)

`get_message_json()`

Build a single message dict in the format expected by a specific model. This is called internally by apply_chat_template() but is also available for constructing messages manually.

Signature

def get_message_json(
    model_name: str,
    prompt: str,
    role: str = "user",
    skip_image_token: bool = False,
    skip_audio_token: bool = False,
    num_images: int = 0,
    num_audios: int = 0,
    **kwargs,
) -> Union[str, Dict[str, Any]]:

Parameters

model_name

str

required

The model type string (the model_type value from the config, e.g. "qwen2_vl", "llava", "gemma3").

prompt

str

required

The text content of the message.

role

str

default:"\"user\""

Message role: "user", "assistant", or "system".

skip_image_token

bool

default:"False"

Skip injecting image tokens even when num_images > 0. Used for all but the first turn in multi-turn conversations.

skip_audio_token

bool

default:"False"

Skip injecting audio tokens even when num_audios > 0.

num_images

int

default:"0"

Number of image tokens to inject.

num_audios

int

default:"0"

Number of audio tokens to inject.

video

str

default:"None"

Video path or URL. When provided, returns a video message dict instead of an image message.

Returns

message

str | dict

A message dict (e.g. {"role": "user", "content": [...]}) or a plain string, depending on the model’s expected format.

Example

from mlx_vlm.prompt_utils import get_message_json

# Qwen2-VL expects a list-with-image format
msg = get_message_json("qwen2_vl", "Describe this image.", num_images=1)
# {"role": "user", "content": [{"type": "text", "text": "Describe this image."}, {"type": "image"}]}

# LLaVA expects an image token in the text
msg = get_message_json("llava", "Describe this image.", num_images=1)
# {"role": "user", "content": [{"type": "text", "text": "Describe this image."}, {"type": "image"}]}

`MessageFormat` enum

MessageFormat defines the message-structure strategy for each model family. apply_chat_template and get_message_json look up the appropriate format from the MODEL_CONFIG mapping.

class MessageFormat(Enum):
    LIST_WITH_IMAGE                      = "list_with_image"
    LIST_WITH_IMAGE_FIRST                = "list_with_image_first"
    LIST_WITH_IMAGE_URL_FIRST            = "list_with_image_url_first"
    LIST_WITH_IMAGE_TYPE                 = "list_with_image_type"
    LIST_WITH_IMAGE_TYPE_TEXT            = "list_with_image_type_text"
    LIST_WITH_IMAGE_TYPE_TEXT_IMAGE_LAST = "list_with_image_type_text_image_last"
    IMAGE_TOKEN                          = "image_token"
    IMAGE_TOKEN_PIPE                     = "image_token_pipe"
    START_IMAGE_TOKEN                    = "start_image_token"
    IMAGE_TOKEN_NEWLINE                  = "image_token_newline"
    NUMBERED_IMAGE_TOKENS                = "numbered_image_tokens"
    PROMPT_ONLY                          = "prompt_only"
    PROMPT_WITH_IMAGE_TOKEN              = "prompt_with_image_token"
    PROMPT_WITH_START_IMAGE_TOKEN        = "prompt_with_start_image_token"
    VIDEO_WITH_TEXT                      = "video_with_text"

Value	Description	Example models
`LIST_WITH_IMAGE`	Content is a list; `{"type": "image"}` appended after text	`qwen2_vl`, `llava`, `mllama`, `idefics2`
`LIST_WITH_IMAGE_FIRST`	Image tokens placed before text in the list	`qwen2_5_vl`, `qwen3_vl`, `smolvlm`, `mistral3`
`LIST_WITH_IMAGE_URL_FIRST`	Uses `{"type": "image_url"}` objects, image first	`ernie4_5_moe_vl`
`LIST_WITH_IMAGE_TYPE`	Content items use `{"type": "content"}` text nodes	`internvl_chat`
`LIST_WITH_IMAGE_TYPE_TEXT`	Content items use `{"type": "text"}` nodes	`gemma3n`, `pixtral`
`IMAGE_TOKEN`	Image token `<image>` prepended to text string	`multi_modality`, `minicpmo`
`IMAGE_TOKEN_PIPE`	Image token `<\|image\|>` prepended to text string	`jina_vlm`
`START_IMAGE_TOKEN`	`<start_of_image>` appended after text	`gemma3`
`IMAGE_TOKEN_NEWLINE`	`<image>\n` prepended to text	`deepseek_vl_v2`, `phi4-siglip`
`NUMBERED_IMAGE_TOKENS`	`<\|image_1\|>`, `<\|image_2\|>` … prepended	`phi3_v`, `phi4mm`
`PROMPT_ONLY`	No image tokens; returns prompt string as-is	`florence2`, `molmo`
`PROMPT_WITH_IMAGE_TOKEN`	`<image>` × N prepended to prompt string	`paligemma`
`VIDEO_WITH_TEXT`	Video message dict with text	Qwen2/2.5/3-VL

You rarely need to use MessageFormat directly. apply_chat_template() selects the correct format automatically based on model_type.

Python API

REST API

Prompt utilities

`apply_chat_template()`

Signature

Parameters

Returns

Examples

`get_message_json()`

Signature

Parameters

Returns

Example

`MessageFormat` enum

Build docs developers (and LLMs) love

Python API

REST API

​apply_chat_template()

​Signature

​Parameters

​Returns

​Examples

​get_message_json()

​Signature

​Parameters

​Returns

​Example

​MessageFormat enum

Build docs developers (and LLMs) love

`apply_chat_template()`

Signature

Parameters

Returns

Examples

`get_message_json()`

Signature

Parameters

Returns

Example

`MessageFormat` enum