Skip to main content
Most models in MLX-VLM work identically: call apply_chat_template, pass images, and call generate. A subset of models have special prompt tokens, unique capabilities, or non-standard generation output that require extra care.
For every model covered here, apply_chat_template still handles prompt construction automatically. The special tokens and formats described below are the underlying format that the template builds for you — you only need to write them manually if you bypass apply_chat_template.
DeepSeek-OCR (deepseekocr) and DeepSeek-OCR-2 (deepseekocr_2) are SAM + Qwen2 encoder models optimized for document understanding, text extraction, and visual grounding. OCR-2 uses a wider projection dimension (1280 vs 1024) but is otherwise identical in usage.Both models use a dynamic-resolution image pipeline: images are split into a global view (1024×1024 → 256 tokens) and up to six local patches (768×768 → 144 tokens each), giving a total of up to 1,121 visual tokens.

Prompt formats

TaskPrompt
General OCR<|grounding|>OCR this image.
Document to Markdown<|grounding|>Convert the document to markdown.
Free OCR (no layout)Free OCR.
Parse figures/chartsParse the figure.
Image descriptionDescribe this image in detail.
Text localizationLocate <|ref|>your text here<|/ref|> in the image.
Localization output format: <|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|> — coordinates are normalized to the 0–1000 range.

Special tokens

TokenPurpose
<|grounding|>Enables structured output mode (OCR, markdown, tables)
<|ref|>...<|/ref|>Wraps the text to locate in the image
<|det|>...<|/det|>Wraps the bounding-box output

CLI examples

mlx_vlm.generate \
    --model mlx-community/DeepSeek-OCR-bf16 \
    --image receipt.jpg \
    --prompt "<|grounding|>OCR this image." \
    --max-tokens 1000

Python — text localization with bounding-box parsing

import re
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/DeepSeek-OCR-bf16")

text_to_find = "Total liabilities"
prompt = f"Locate <|ref|>{text_to_find}<|/ref|> in the image."
formatted_prompt = apply_chat_template(processor, model.config, prompt, num_images=1)

result = generate(
    model=model,
    processor=processor,
    image="financial_table.png",
    prompt=formatted_prompt,
    max_tokens=100,
    temperature=0.0,
)

# Output format: <|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
# Coordinates are normalized to 0-1000
match = re.search(r'\[\[(\d+),\s*(\d+),\s*(\d+),\s*(\d+)\]\]', result.text)
if match:
    x1, y1, x2, y2 = map(int, match.groups())
    print(f"Bounding box: ({x1}, {y1}) to ({x2}, {y2})")

Controlling dynamic resolution

By default, the model uses 1–6 local patches. You can trade speed for detail by limiting the patch count:
# Global view only — fastest (257 tokens)
mlx_vlm.generate \
    --model mlx-community/DeepSeek-OCR-bf16 \
    --image document.png \
    --prompt "<|grounding|>OCR this image." \
    --max-tokens 1000 \
    --processor-kwargs '{"cropping": false}'

# Up to 3 patches — balanced
mlx_vlm.generate \
    --model mlx-community/DeepSeek-OCR-bf16 \
    --image document.png \
    --prompt "<|grounding|>OCR this image." \
    --max-tokens 1000 \
    --processor-kwargs '{"cropping": true, "max_patches": 3}'
ConfigurationPatchesVisual tokens
cropping=False0257
max_patches=11401
max_patches=31–3401–689
max_patches=6 (default)1–6401–1121
Use temperature=0.0 for OCR tasks to get deterministic output. Increase max_tokens to 2000 or more for full-page documents.
DOTS models (dots_ocr) are vision-language models from rednote-hilab designed for document parsing, layout analysis, table and formula extraction, and structured JSON output. dots.mocr extends dots.ocr with stronger multilingual parsing.

Layout JSON extraction

The most powerful prompt instructs the model to output a structured JSON with bounding boxes, element categories, and formatted text:
mlx_vlm.generate \
  --model mlx-community/dots.mocr-4bit \
  --image page.png \
  --prompt "Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]

2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].

3. Text Extraction & Formatting Rules:
    - Picture: For the 'Picture' category, the text field should be omitted.
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others (Text, Title, etc.): Format their text as Markdown.

4. Constraints:
    - The output text must be the original text from the image, with no translation.
    - All layout elements must be sorted according to human reading order.

5. Final Output: The entire output must be a single JSON object." \
  --max-tokens 5000

Python — layout JSON extraction

from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template

MODEL = "mlx-community/dots.mocr-4bit"
PROMPT = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]
2. Layout Categories: ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title']
3. Text Extraction & Formatting Rules:
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others: Format their text as Markdown.
4. All layout elements must be sorted in reading order.
5. Final Output: The entire output must be a single JSON object."""

model, processor = load(MODEL)
formatted_prompt = apply_chat_template(processor, model.config, PROMPT, num_images=1)

result = generate(
    model=model,
    processor=processor,
    prompt=formatted_prompt,
    image="page.png",
    max_tokens=5000,
    temperature=0.0,
)
print(result.text)
For long or layout-heavy documents, increase --max-tokens significantly (5000+). Keep prompts explicit about the desired schema and element ordering.
phi4_siglip is Microsoft’s Phi-4 Reasoning Vision model (microsoft/Phi-4-reasoning-vision-15B), a ~15B parameter model that combines a Phi-3 language backbone with a SigLIP2 NaFlex vision encoder.The SigLIP2 NaFlex encoder processes images at variable resolution, producing 256–3600 patches depending on input size and aspect ratio, making it efficient across different image dimensions.

Key properties

PropertyValue
HF model IDmicrosoft/Phi-4-reasoning-vision-15B
ArchitecturePhi-3 language + SigLIP2 NaFlex vision + 2-layer MLP GELU projector
Parameters~15B
Image token<image> (index -200)
Vision patches256–3600 (variable, NaFlex dynamic resolution)

CLI

mlx_vlm.generate \
    --model microsoft/Phi-4-reasoning-vision-15B \
    --image path/to/image.jpg \
    --prompt "Describe this image in detail." \
    --max-tokens 512 \
    --temp 0.0

Python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("microsoft/Phi-4-reasoning-vision-15B")

formatted = apply_chat_template(
    processor, model.config, "Describe this image.", num_images=1
)

output = generate(
    model,
    processor,
    formatted,
    ["path/to/image.jpg"],
    max_tokens=512,
    temperature=0.0,
)
print(output)
NaFlex produces variable-length image feature sequences, so effective context length depends on input resolution. High-resolution images consume significantly more context.
phi4mm is Microsoft’s Phi-4 Multimodal Instruct model (microsoft/Phi-4-multimodal-instruct) — a tri-modal model supporting text, image, and audio inputs simultaneously.

Architecture highlights

ComponentDetails
Language modelPhi-4 (32 layers, 3072 hidden, 24 heads, 8 KV heads)
Vision encoderSigLIP-2 (27 layers, 1152 hidden, 16 heads)
Audio encoderCascades Conformer (24 blocks, 1024 dim, 16 heads)
Vision projector2-layer MLP (1152 → 3072 → 3072, GELU)
The model ships with two LoRA adapters — a vision LoRA (r=256) and a speech LoRA (r=320) — which are merged automatically at load time. set_modality() selects the correct adapter at runtime based on input types.

CLI examples

mlx_vlm.generate \
  --model microsoft/Phi-4-multimodal-instruct \
  --image /path/to/image.jpg \
  --prompt "Describe this image." \
  --max-tokens 256

Python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("microsoft/Phi-4-multimodal-instruct")

image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = "What animals are in the image?"

formatted_prompt = apply_chat_template(
    processor,
    model.config,
    prompt,
    num_images=len(image),
    num_audios=len(audio),
)

result = generate(
    model=model,
    processor=processor,
    prompt=formatted_prompt,
    image=image,
    audio=audio,
    max_tokens=256,
    temperature=0.0,
)
print(result.text)
Audio input must be a 16 kHz mono waveform. The processor handles resampling automatically. The <|image_1|> and <|audio_1|> placeholders are inserted by apply_chat_template — do not add them manually.
minicpmo is an omni model (openbmb/MiniCPM-o-4_5) that supports text, image, and audio understanding. The processor and tokenizer code is ported in-tree, so --trust-remote-code is not required.MiniCPM-o includes a thinking mode that is enabled by default. You can disable it via chat_template_kwargs.

CLI examples

mlx_vlm.generate \
  --model openbmb/MiniCPM-o-4_5 \
  --image /path/to/image.jpg \
  --prompt "Describe this image briefly." \
  --max-tokens 128 \
  --temperature 0

Python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("openbmb/MiniCPM-o-4_5")

image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = "Summarize the visual and audio content."

formatted_prompt = apply_chat_template(
    processor,
    model.config,
    prompt,
    num_images=len(image),
    num_audios=len(audio),
    chat_template_kwargs={"enable_thinking": False},
)

result = generate(
    model=model,
    processor=processor,
    prompt=formatted_prompt,
    image=image,
    audio=audio,
    max_tokens=256,
    temperature=0.0,
)
print(result.text)
Do not manually add <image> or <audio> markers when using apply_chat_template — placeholders are inserted automatically.
molmo_point is Allen AI’s MolmoPoint model (allenai/MolmoPoint-8B), a vision-language model with pixel-precise pointing and grounding capabilities. In addition to standard VQA and description tasks, it can return exact (x, y) coordinates for objects named in the prompt.

How pointing works

MolmoPoint extends the standard vocabulary with three special token types:
  1. Patch tokens — select which image patch contains the target object
  2. Subpatch tokens — select a specific ViT sub-patch within that patch
  3. Location tokens — refine the point to a 3×3 grid cell within the sub-patch
Each point in the output is encoded as <POINT_patch> <POINT_subpatch> <POINT_location> object_id.

CLI

# Standard description
python -m mlx_vlm.generate \
    --model allenai/MolmoPoint-8B \
    --image path/to/image.jpg \
    --prompt "Describe this image" \
    --max-tokens 200

# Point to objects
python -m mlx_vlm.generate \
    --model allenai/MolmoPoint-8B \
    --image path/to/image.jpg \
    --prompt "Point to the cats" \
    --max-tokens 50

Python — point extraction and visualization

import mlx.core as mx
from mlx_vlm.utils import load, prepare_inputs
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.generate import generate_step
from mlx_vlm.models.molmo_point.point_utils import (
    extract_points_from_text,
    draw_points_on_image,
)

model, processor = load("allenai/MolmoPoint-8B")
mx.eval(model.parameters())

image_path = "path/to/image.jpg"
prompt = apply_chat_template(
    processor, model.config, "Point to the cats", num_images=1
)
inputs = prepare_inputs(processor, images=image_path, prompts=prompt)

input_ids = inputs["input_ids"]
pixel_values = inputs.get("pixel_values", None)
mask = inputs.get("attention_mask", None)
kwargs = {
    k: v
    for k, v in inputs.items()
    if k not in ["input_ids", "pixel_values", "attention_mask"]
}

# Generate
tokens = []
for n, (token, _) in enumerate(
    generate_step(input_ids, model, pixel_values, mask, max_tokens=50, **kwargs)
):
    tokens.append(token)
    if n >= 49:
        break

output_text = processor.tokenizer.decode(tokens)

# Extract (x, y) coordinates from output tokens
if hasattr(processor, "_pointing_metadata") and processor._pointing_metadata:
    points = extract_points_from_text(
        output_text,
        processor._pointing_metadata,
        no_more_points_class=model.config.no_more_points_class,
        patch_location=model.config.patch_location,
    )
    for obj_id, img_num, x, y in points:
        print(f"Object {obj_id}: ({x:.1f}, {y:.1f})")

    # Save image with annotated points
    draw_points_on_image(image_path, points, "output_pointed.jpg")
Point extraction requires _pointing_metadata stored on the processor after calling prepare_inputs with images. The full bf16 model uses approximately 39 GB peak memory.
Moondream3 (moondream/moondream3-preview) is a Mixture-of-Experts vision-language model with 9.27B total parameters and approximately 2B active parameters per token. It uses a SigLIP-based vision encoder and an MoE text decoder.

Architecture summary

ComponentDetails
Model IDmoondream/moondream3-preview
Total parameters~9.27B (2B active per token)
Vision encoderSigLIP ViT — 27 layers, 1152 dim, 16 heads, patch size 14, crop size 378
Language model24 layers; layers 0–3 dense, layers 4–23 MoE (64 experts, top-8)
Tokenizermoondream/starmie-v1 (SuperBPE)
The model processes images via multi-crop (up to 12 crops), producing 729 tokens per crop via 14×14 patches on 378×378 crops. Global and local crop features are projected to 2048-dim via a 2-layer MLP.Moondream3 uses a prompt-only format (PROMPT_ONLY) — apply_chat_template is not required for basic use.

CLI

python -m mlx_vlm.generate \
    --model moondream/moondream3-preview \
    --image path/to/image.jpg \
    --prompt "Describe this image" \
    --max-tokens 200

Python

from mlx_vlm import load, generate

model, processor = load("moondream/moondream3-preview")

output = generate(
    model,
    processor,
    "Describe this image",
    ["path/to/image.jpg"],
    max_tokens=200,
)
print(output)
The model may emit a thinking token (<|md_reserved_4|>) before the answer — this is expected behavior. Peak memory for the full bf16 model is approximately 24 GB.

Build docs developers (and LLMs) love