Model-specific guides

DeepSeek-OCR and DeepSeek-OCR-2

DeepSeek-OCR (deepseekocr) and DeepSeek-OCR-2 (deepseekocr_2) are SAM + Qwen2 encoder models optimized for document understanding, text extraction, and visual grounding. OCR-2 uses a wider projection dimension (1280 vs 1024) but is otherwise identical in usage.Both models use a dynamic-resolution image pipeline: images are split into a global view (1024×1024 → 256 tokens) and up to six local patches (768×768 → 144 tokens each), giving a total of up to 1,121 visual tokens.

Prompt formats

Task	Prompt
General OCR	`<\|grounding\|>OCR this image.`
Document to Markdown	`<\|grounding\|>Convert the document to markdown.`
Free OCR (no layout)	`Free OCR.`
Parse figures/charts	`Parse the figure.`
Image description	`Describe this image in detail.`
Text localization	`Locate <\|ref\|>your text here<\|/ref\|> in the image.`

Localization output format: <|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|> — coordinates are normalized to the 0–1000 range.

Special tokens

Token	Purpose
`<\|grounding\|>`	Enables structured output mode (OCR, markdown, tables)
`<\|ref\|>...<\|/ref\|>`	Wraps the text to locate in the image
`<\|det\|>...<\|/det\|>`	Wraps the bounding-box output

CLI examples

General OCR
Document to Markdown
Text localization

mlx_vlm.generate \
    --model mlx-community/DeepSeek-OCR-bf16 \
    --image receipt.jpg \
    --prompt "<|grounding|>OCR this image." \
    --max-tokens 1000

mlx_vlm.generate \
    --model mlx-community/DeepSeek-OCR-bf16 \
    --image document.png \
    --prompt "<|grounding|>Convert the document to markdown." \
    --max-tokens 2000

mlx_vlm.generate \
    --model mlx-community/DeepSeek-OCR-bf16 \
    --image table.jpeg \
    --prompt "Locate <|ref|>Total assets<|/ref|> in the image." \
    --max-tokens 100

Python — text localization with bounding-box parsing

import re
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("mlx-community/DeepSeek-OCR-bf16")

text_to_find = "Total liabilities"
prompt = f"Locate <|ref|>{text_to_find}<|/ref|> in the image."
formatted_prompt = apply_chat_template(processor, model.config, prompt, num_images=1)

result = generate(
    model=model,
    processor=processor,
    image="financial_table.png",
    prompt=formatted_prompt,
    max_tokens=100,
    temperature=0.0,
)

# Output format: <|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
# Coordinates are normalized to 0-1000
match = re.search(r'\[\[(\d+),\s*(\d+),\s*(\d+),\s*(\d+)\]\]', result.text)
if match:
    x1, y1, x2, y2 = map(int, match.groups())
    print(f"Bounding box: ({x1}, {y1}) to ({x2}, {y2})")

Controlling dynamic resolution

By default, the model uses 1–6 local patches. You can trade speed for detail by limiting the patch count:

# Global view only — fastest (257 tokens)
mlx_vlm.generate \
    --model mlx-community/DeepSeek-OCR-bf16 \
    --image document.png \
    --prompt "<|grounding|>OCR this image." \
    --max-tokens 1000 \
    --processor-kwargs '{"cropping": false}'

# Up to 3 patches — balanced
mlx_vlm.generate \
    --model mlx-community/DeepSeek-OCR-bf16 \
    --image document.png \
    --prompt "<|grounding|>OCR this image." \
    --max-tokens 1000 \
    --processor-kwargs '{"cropping": true, "max_patches": 3}'

Configuration	Patches	Visual tokens
`cropping=False`	0	257
`max_patches=1`	1	401
`max_patches=3`	1–3	401–689
`max_patches=6` (default)	1–6	401–1121

Use temperature=0.0 for OCR tasks to get deterministic output. Increase max_tokens to 2000 or more for full-page documents.

DOTS-OCR (dots.ocr / dots.mocr)

DOTS models (dots_ocr) are vision-language models from rednote-hilab designed for document parsing, layout analysis, table and formula extraction, and structured JSON output. dots.mocr extends dots.ocr with stronger multilingual parsing.

Layout JSON extraction

The most powerful prompt instructs the model to output a structured JSON with bounding boxes, element categories, and formatted text:

mlx_vlm.generate \
  --model mlx-community/dots.mocr-4bit \
  --image page.png \
  --prompt "Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]

2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].

3. Text Extraction & Formatting Rules:
    - Picture: For the 'Picture' category, the text field should be omitted.
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others (Text, Title, etc.): Format their text as Markdown.

4. Constraints:
    - The output text must be the original text from the image, with no translation.
    - All layout elements must be sorted according to human reading order.

5. Final Output: The entire output must be a single JSON object." \
  --max-tokens 5000

Python — layout JSON extraction

from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template

MODEL = "mlx-community/dots.mocr-4bit"
PROMPT = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]
2. Layout Categories: ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title']
3. Text Extraction & Formatting Rules:
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others: Format their text as Markdown.
4. All layout elements must be sorted in reading order.
5. Final Output: The entire output must be a single JSON object."""

model, processor = load(MODEL)
formatted_prompt = apply_chat_template(processor, model.config, PROMPT, num_images=1)

result = generate(
    model=model,
    processor=processor,
    prompt=formatted_prompt,
    image="page.png",
    max_tokens=5000,
    temperature=0.0,
)
print(result.text)

For long or layout-heavy documents, increase --max-tokens significantly (5000+). Keep prompts explicit about the desired schema and element ordering.

Phi-4 Reasoning Vision (phi4_siglip)

phi4_siglip is Microsoft’s Phi-4 Reasoning Vision model (microsoft/Phi-4-reasoning-vision-15B), a ~15B parameter model that combines a Phi-3 language backbone with a SigLIP2 NaFlex vision encoder.The SigLIP2 NaFlex encoder processes images at variable resolution, producing 256–3600 patches depending on input size and aspect ratio, making it efficient across different image dimensions.

Key properties

Property	Value
HF model ID	`microsoft/Phi-4-reasoning-vision-15B`
Architecture	Phi-3 language + SigLIP2 NaFlex vision + 2-layer MLP GELU projector
Parameters	~15B
Image token	`<image>` (index -200)
Vision patches	256–3600 (variable, NaFlex dynamic resolution)

CLI

mlx_vlm.generate \
    --model microsoft/Phi-4-reasoning-vision-15B \
    --image path/to/image.jpg \
    --prompt "Describe this image in detail." \
    --max-tokens 512 \
    --temp 0.0

Python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("microsoft/Phi-4-reasoning-vision-15B")

formatted = apply_chat_template(
    processor, model.config, "Describe this image.", num_images=1
)

output = generate(
    model,
    processor,
    formatted,
    ["path/to/image.jpg"],
    max_tokens=512,
    temperature=0.0,
)
print(output)

NaFlex produces variable-length image feature sequences, so effective context length depends on input resolution. High-resolution images consume significantly more context.

Phi-4 Multimodal (phi4mm)

phi4mm is Microsoft’s Phi-4 Multimodal Instruct model (microsoft/Phi-4-multimodal-instruct) — a tri-modal model supporting text, image, and audio inputs simultaneously.

Architecture highlights

Component	Details
Language model	Phi-4 (32 layers, 3072 hidden, 24 heads, 8 KV heads)
Vision encoder	SigLIP-2 (27 layers, 1152 hidden, 16 heads)
Audio encoder	Cascades Conformer (24 blocks, 1024 dim, 16 heads)
Vision projector	2-layer MLP (1152 → 3072 → 3072, GELU)

The model ships with two LoRA adapters — a vision LoRA (r=256) and a speech LoRA (r=320) — which are merged automatically at load time. set_modality() selects the correct adapter at runtime based on input types.

CLI examples

Image
Audio
Image + audio

mlx_vlm.generate \
  --model microsoft/Phi-4-multimodal-instruct \
  --image /path/to/image.jpg \
  --prompt "Describe this image." \
  --max-tokens 256

mlx_vlm.generate \
  --model microsoft/Phi-4-multimodal-instruct \
  --audio /path/to/audio.wav \
  --prompt "" \
  --max-tokens 256

mlx_vlm.generate \
  --model microsoft/Phi-4-multimodal-instruct \
  --image /path/to/image.jpg \
  --audio /path/to/audio.wav \
  --prompt "" \
  --max-tokens 256

Python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("microsoft/Phi-4-multimodal-instruct")

image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = "What animals are in the image?"

formatted_prompt = apply_chat_template(
    processor,
    model.config,
    prompt,
    num_images=len(image),
    num_audios=len(audio),
)

result = generate(
    model=model,
    processor=processor,
    prompt=formatted_prompt,
    image=image,
    audio=audio,
    max_tokens=256,
    temperature=0.0,
)
print(result.text)

Audio input must be a 16 kHz mono waveform. The processor handles resampling automatically. The <|image_1|> and <|audio_1|> placeholders are inserted by apply_chat_template — do not add them manually.

MiniCPM-o (minicpmo)

minicpmo is an omni model (openbmb/MiniCPM-o-4_5) that supports text, image, and audio understanding. The processor and tokenizer code is ported in-tree, so --trust-remote-code is not required.MiniCPM-o includes a thinking mode that is enabled by default. You can disable it via chat_template_kwargs.

CLI examples

Image
Audio
Image + audio
Disable thinking

mlx_vlm.generate \
  --model openbmb/MiniCPM-o-4_5 \
  --image /path/to/image.jpg \
  --prompt "Describe this image briefly." \
  --max-tokens 128 \
  --temperature 0

mlx_vlm.generate \
  --model openbmb/MiniCPM-o-4_5 \
  --audio /path/to/audio.wav \
  --prompt "Describe this audio briefly." \
  --max-tokens 256 \
  --temperature 0

mlx_vlm.generate \
  --model openbmb/MiniCPM-o-4_5 \
  --image /path/to/image.jpg \
  --audio /path/to/audio.wav \
  --prompt "Describe what you see and hear." \
  --max-tokens 256 \
  --temperature 0

mlx_vlm.generate \
  --model openbmb/MiniCPM-o-4_5 \
  --audio /path/to/audio.wav \
  --prompt "Describe this audio briefly." \
  --chat-template-kwargs '{"enable_thinking": false}' \
  --max-tokens 256 \
  --temperature 0

Python

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("openbmb/MiniCPM-o-4_5")

image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = "Summarize the visual and audio content."

formatted_prompt = apply_chat_template(
    processor,
    model.config,
    prompt,
    num_images=len(image),
    num_audios=len(audio),
    chat_template_kwargs={"enable_thinking": False},
)

result = generate(
    model=model,
    processor=processor,
    prompt=formatted_prompt,
    image=image,
    audio=audio,
    max_tokens=256,
    temperature=0.0,
)
print(result.text)

Do not manually add <image> or <audio> markers when using apply_chat_template — placeholders are inserted automatically.

MolmoPoint (molmo_point)

molmo_point is Allen AI’s MolmoPoint model (allenai/MolmoPoint-8B), a vision-language model with pixel-precise pointing and grounding capabilities. In addition to standard VQA and description tasks, it can return exact (x, y) coordinates for objects named in the prompt.

How pointing works

MolmoPoint extends the standard vocabulary with three special token types:

Patch tokens — select which image patch contains the target object
Subpatch tokens — select a specific ViT sub-patch within that patch
Location tokens — refine the point to a 3×3 grid cell within the sub-patch

Each point in the output is encoded as <POINT_patch> <POINT_subpatch> <POINT_location> object_id.

CLI

# Standard description
python -m mlx_vlm.generate \
    --model allenai/MolmoPoint-8B \
    --image path/to/image.jpg \
    --prompt "Describe this image" \
    --max-tokens 200

# Point to objects
python -m mlx_vlm.generate \
    --model allenai/MolmoPoint-8B \
    --image path/to/image.jpg \
    --prompt "Point to the cats" \
    --max-tokens 50

Python — point extraction and visualization

import mlx.core as mx
from mlx_vlm.utils import load, prepare_inputs
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.generate import generate_step
from mlx_vlm.models.molmo_point.point_utils import (
    extract_points_from_text,
    draw_points_on_image,
)

model, processor = load("allenai/MolmoPoint-8B")
mx.eval(model.parameters())

image_path = "path/to/image.jpg"
prompt = apply_chat_template(
    processor, model.config, "Point to the cats", num_images=1
)
inputs = prepare_inputs(processor, images=image_path, prompts=prompt)

input_ids = inputs["input_ids"]
pixel_values = inputs.get("pixel_values", None)
mask = inputs.get("attention_mask", None)
kwargs = {
    k: v
    for k, v in inputs.items()
    if k not in ["input_ids", "pixel_values", "attention_mask"]
}

# Generate
tokens = []
for n, (token, _) in enumerate(
    generate_step(input_ids, model, pixel_values, mask, max_tokens=50, **kwargs)
):
    tokens.append(token)
    if n >= 49:
        break

output_text = processor.tokenizer.decode(tokens)

# Extract (x, y) coordinates from output tokens
if hasattr(processor, "_pointing_metadata") and processor._pointing_metadata:
    points = extract_points_from_text(
        output_text,
        processor._pointing_metadata,
        no_more_points_class=model.config.no_more_points_class,
        patch_location=model.config.patch_location,
    )
    for obj_id, img_num, x, y in points:
        print(f"Object {obj_id}: ({x:.1f}, {y:.1f})")

    # Save image with annotated points
    draw_points_on_image(image_path, points, "output_pointed.jpg")

Point extraction requires _pointing_metadata stored on the processor after calling prepare_inputs with images. The full bf16 model uses approximately 39 GB peak memory.

Moondream3 (moondream3)

Moondream3 (moondream/moondream3-preview) is a Mixture-of-Experts vision-language model with 9.27B total parameters and approximately 2B active parameters per token. It uses a SigLIP-based vision encoder and an MoE text decoder.

Architecture summary

Component	Details
Model ID	`moondream/moondream3-preview`
Total parameters	~9.27B (2B active per token)
Vision encoder	SigLIP ViT — 27 layers, 1152 dim, 16 heads, patch size 14, crop size 378
Language model	24 layers; layers 0–3 dense, layers 4–23 MoE (64 experts, top-8)
Tokenizer	`moondream/starmie-v1` (SuperBPE)

The model processes images via multi-crop (up to 12 crops), producing 729 tokens per crop via 14×14 patches on 378×378 crops. Global and local crop features are projected to 2048-dim via a 2-layer MLP.Moondream3 uses a prompt-only format (PROMPT_ONLY) — apply_chat_template is not required for basic use.

CLI

python -m mlx_vlm.generate \
    --model moondream/moondream3-preview \
    --image path/to/image.jpg \
    --prompt "Describe this image" \
    --max-tokens 200

Python

from mlx_vlm import load, generate

model, processor = load("moondream/moondream3-preview")

output = generate(
    model,
    processor,
    "Describe this image",
    ["path/to/image.jpg"],
    max_tokens=200,
)
print(output)

The model may emit a thinking token (<|md_reserved_4|>) before the answer — this is expected behavior. Peak memory for the full bf16 model is approximately 24 GB.

Get Started

Inference

Fine-Tuning

Advanced

Models

Model-specific guides

Prompt formats

Special tokens

CLI examples

Python — text localization with bounding-box parsing

Controlling dynamic resolution

Layout JSON extraction

Python — layout JSON extraction

Key properties

CLI

Python

Architecture highlights

CLI examples

Python

CLI examples

Python

How pointing works

CLI

Python — point extraction and visualization

Architecture summary

CLI

Python

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Prompt formats

​Special tokens

​CLI examples

​Python — text localization with bounding-box parsing

​Controlling dynamic resolution

​Layout JSON extraction

​Python — layout JSON extraction

​Key properties

​CLI

​Python

​Architecture highlights

​CLI examples

​Python

​CLI examples

​Python

​How pointing works

​CLI

​Python — point extraction and visualization

​Architecture summary

​CLI

​Python

Build docs developers (and LLMs) love

Prompt formats

Special tokens

CLI examples

Python — text localization with bounding-box parsing

Controlling dynamic resolution

Layout JSON extraction

Python — layout JSON extraction

Key properties

CLI

Python

Architecture highlights

CLI examples

Python

CLI examples

Python

How pointing works

CLI

Python — point extraction and visualization

Architecture summary

CLI

Python