Skip to main content

apply_chat_template()

Format a prompt (or conversation history) into the string that the model’s tokenizer expects. Internally looks up the model type from the config, builds message objects in the correct format for that model, then applies the tokenizer’s chat template.
from mlx_vlm.prompt_utils import apply_chat_template

formatted = apply_chat_template(processor, config, "Describe this image.", num_images=1)

Signature

def apply_chat_template(
    processor,
    config: Union[Dict[str, Any], Any],
    prompt: Union[str, Dict[str, Any], List[Any]],
    add_generation_prompt: bool = True,
    return_messages: bool = False,
    num_images: int = 0,
    num_audios: int = 0,
    **kwargs,
) -> Union[List[Dict[str, Any]], str, Any]:

Parameters

processor
ProcessorMixin
required
The processor returned by load(). Used to apply the tokenizer’s chat template.
config
dict | Any
required
The model configuration dict or config object. Must contain a model_type key. Retrieve with model.config or load_config().
prompt
str | dict | list
required
The user input. Three forms are accepted:
  • str — A single text prompt.
  • dict — A single message with role and content keys.
  • list — A conversation history: a list of strings, dicts, or objects with .role / .content attributes.
add_generation_prompt
bool
default:"True"
Append the model’s generation-prompt suffix (e.g. <|im_start|>assistant) so the model knows to generate a response.
return_messages
bool
default:"False"
When True, return the intermediate list of message dicts instead of the fully rendered chat template string. Useful for inspection or custom post-processing.
num_images
int
default:"0"
Number of images in the input. Image tokens are injected into the first user message.
num_audios
int
default:"0"
Number of audio files in the input. Audio tokens are injected into the first user message.
enable_thinking
bool
default:"False"
Pass enable_thinking=True to the underlying chat template for reasoning models such as Qwen3.5.
video
str
default:"None"
Path or URL to a video file. When provided, a video message is constructed instead of image tokens. Supported by Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and related models.
max_pixels
int
default:"50176"
Maximum pixels per video frame (224 * 224 = 50176 by default). Used with video.
fps
int
default:"1"
Frames per second to extract from the video. Used with video.

Returns

formatted_prompt
str | list
The chat-template-rendered string ready to pass to generate(). When return_messages=True, returns the list of message dicts before template rendering. For paligemma, molmo, and florence2 model types, returns only the last message dict.

Examples

from mlx_vlm import load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")

prompt = apply_chat_template(
    processor, model.config, "What is the capital of France?"
)

get_message_json()

Build a single message dict in the format expected by a specific model. This is called internally by apply_chat_template() but is also available for constructing messages manually.

Signature

def get_message_json(
    model_name: str,
    prompt: str,
    role: str = "user",
    skip_image_token: bool = False,
    skip_audio_token: bool = False,
    num_images: int = 0,
    num_audios: int = 0,
    **kwargs,
) -> Union[str, Dict[str, Any]]:

Parameters

model_name
str
required
The model type string (the model_type value from the config, e.g. "qwen2_vl", "llava", "gemma3").
prompt
str
required
The text content of the message.
role
str
default:"\"user\""
Message role: "user", "assistant", or "system".
skip_image_token
bool
default:"False"
Skip injecting image tokens even when num_images > 0. Used for all but the first turn in multi-turn conversations.
skip_audio_token
bool
default:"False"
Skip injecting audio tokens even when num_audios > 0.
num_images
int
default:"0"
Number of image tokens to inject.
num_audios
int
default:"0"
Number of audio tokens to inject.
video
str
default:"None"
Video path or URL. When provided, returns a video message dict instead of an image message.

Returns

message
str | dict
A message dict (e.g. {"role": "user", "content": [...]}) or a plain string, depending on the model’s expected format.

Example

from mlx_vlm.prompt_utils import get_message_json

# Qwen2-VL expects a list-with-image format
msg = get_message_json("qwen2_vl", "Describe this image.", num_images=1)
# {"role": "user", "content": [{"type": "text", "text": "Describe this image."}, {"type": "image"}]}

# LLaVA expects an image token in the text
msg = get_message_json("llava", "Describe this image.", num_images=1)
# {"role": "user", "content": [{"type": "text", "text": "Describe this image."}, {"type": "image"}]}

MessageFormat enum

MessageFormat defines the message-structure strategy for each model family. apply_chat_template and get_message_json look up the appropriate format from the MODEL_CONFIG mapping.
class MessageFormat(Enum):
    LIST_WITH_IMAGE                      = "list_with_image"
    LIST_WITH_IMAGE_FIRST                = "list_with_image_first"
    LIST_WITH_IMAGE_URL_FIRST            = "list_with_image_url_first"
    LIST_WITH_IMAGE_TYPE                 = "list_with_image_type"
    LIST_WITH_IMAGE_TYPE_TEXT            = "list_with_image_type_text"
    LIST_WITH_IMAGE_TYPE_TEXT_IMAGE_LAST = "list_with_image_type_text_image_last"
    IMAGE_TOKEN                          = "image_token"
    IMAGE_TOKEN_PIPE                     = "image_token_pipe"
    START_IMAGE_TOKEN                    = "start_image_token"
    IMAGE_TOKEN_NEWLINE                  = "image_token_newline"
    NUMBERED_IMAGE_TOKENS                = "numbered_image_tokens"
    PROMPT_ONLY                          = "prompt_only"
    PROMPT_WITH_IMAGE_TOKEN              = "prompt_with_image_token"
    PROMPT_WITH_START_IMAGE_TOKEN        = "prompt_with_start_image_token"
    VIDEO_WITH_TEXT                      = "video_with_text"
ValueDescriptionExample models
LIST_WITH_IMAGEContent is a list; {"type": "image"} appended after textqwen2_vl, llava, mllama, idefics2
LIST_WITH_IMAGE_FIRSTImage tokens placed before text in the listqwen2_5_vl, qwen3_vl, smolvlm, mistral3
LIST_WITH_IMAGE_URL_FIRSTUses {"type": "image_url"} objects, image firsternie4_5_moe_vl
LIST_WITH_IMAGE_TYPEContent items use {"type": "content"} text nodesinternvl_chat
LIST_WITH_IMAGE_TYPE_TEXTContent items use {"type": "text"} nodesgemma3n, pixtral
IMAGE_TOKENImage token <image> prepended to text stringmulti_modality, minicpmo
IMAGE_TOKEN_PIPEImage token <|image|> prepended to text stringjina_vlm
START_IMAGE_TOKEN<start_of_image> appended after textgemma3
IMAGE_TOKEN_NEWLINE<image>\n prepended to textdeepseek_vl_v2, phi4-siglip
NUMBERED_IMAGE_TOKENS<|image_1|>, <|image_2|> … prependedphi3_v, phi4mm
PROMPT_ONLYNo image tokens; returns prompt string as-isflorence2, molmo
PROMPT_WITH_IMAGE_TOKEN<image> × N prepended to prompt stringpaligemma
VIDEO_WITH_TEXTVideo message dict with textQwen2/2.5/3-VL
You rarely need to use MessageFormat directly. apply_chat_template() selects the correct format automatically based on model_type.

Build docs developers (and LLMs) love