apply_chat_template, pass images, and call generate. A subset of models have special prompt tokens, unique capabilities, or non-standard generation output that require extra care.
For every model covered here,
apply_chat_template still handles prompt construction automatically. The special tokens and formats described below are the underlying format that the template builds for you — you only need to write them manually if you bypass apply_chat_template.DeepSeek-OCR and DeepSeek-OCR-2
DeepSeek-OCR and DeepSeek-OCR-2
DeepSeek-OCR (
Localization output format:
deepseekocr) and DeepSeek-OCR-2 (deepseekocr_2) are SAM + Qwen2 encoder models optimized for document understanding, text extraction, and visual grounding. OCR-2 uses a wider projection dimension (1280 vs 1024) but is otherwise identical in usage.Both models use a dynamic-resolution image pipeline: images are split into a global view (1024×1024 → 256 tokens) and up to six local patches (768×768 → 144 tokens each), giving a total of up to 1,121 visual tokens.Prompt formats
| Task | Prompt |
|---|---|
| General OCR | <|grounding|>OCR this image. |
| Document to Markdown | <|grounding|>Convert the document to markdown. |
| Free OCR (no layout) | Free OCR. |
| Parse figures/charts | Parse the figure. |
| Image description | Describe this image in detail. |
| Text localization | Locate <|ref|>your text here<|/ref|> in the image. |
<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|> — coordinates are normalized to the 0–1000 range.Special tokens
| Token | Purpose |
|---|---|
<|grounding|> | Enables structured output mode (OCR, markdown, tables) |
<|ref|>...<|/ref|> | Wraps the text to locate in the image |
<|det|>...<|/det|> | Wraps the bounding-box output |
CLI examples
- General OCR
- Document to Markdown
- Text localization
Python — text localization with bounding-box parsing
Controlling dynamic resolution
By default, the model uses 1–6 local patches. You can trade speed for detail by limiting the patch count:| Configuration | Patches | Visual tokens |
|---|---|---|
cropping=False | 0 | 257 |
max_patches=1 | 1 | 401 |
max_patches=3 | 1–3 | 401–689 |
max_patches=6 (default) | 1–6 | 401–1121 |
DOTS-OCR (dots.ocr / dots.mocr)
DOTS-OCR (dots.ocr / dots.mocr)
DOTS models (
dots_ocr) are vision-language models from rednote-hilab designed for document parsing, layout analysis, table and formula extraction, and structured JSON output. dots.mocr extends dots.ocr with stronger multilingual parsing.Layout JSON extraction
The most powerful prompt instructs the model to output a structured JSON with bounding boxes, element categories, and formatted text:Python — layout JSON extraction
Phi-4 Reasoning Vision (phi4_siglip)
Phi-4 Reasoning Vision (phi4_siglip)
phi4_siglip is Microsoft’s Phi-4 Reasoning Vision model (microsoft/Phi-4-reasoning-vision-15B), a ~15B parameter model that combines a Phi-3 language backbone with a SigLIP2 NaFlex vision encoder.The SigLIP2 NaFlex encoder processes images at variable resolution, producing 256–3600 patches depending on input size and aspect ratio, making it efficient across different image dimensions.Key properties
| Property | Value |
|---|---|
| HF model ID | microsoft/Phi-4-reasoning-vision-15B |
| Architecture | Phi-3 language + SigLIP2 NaFlex vision + 2-layer MLP GELU projector |
| Parameters | ~15B |
| Image token | <image> (index -200) |
| Vision patches | 256–3600 (variable, NaFlex dynamic resolution) |
CLI
Python
NaFlex produces variable-length image feature sequences, so effective context length depends on input resolution. High-resolution images consume significantly more context.
Phi-4 Multimodal (phi4mm)
Phi-4 Multimodal (phi4mm)
phi4mm is Microsoft’s Phi-4 Multimodal Instruct model (microsoft/Phi-4-multimodal-instruct) — a tri-modal model supporting text, image, and audio inputs simultaneously.Architecture highlights
| Component | Details |
|---|---|
| Language model | Phi-4 (32 layers, 3072 hidden, 24 heads, 8 KV heads) |
| Vision encoder | SigLIP-2 (27 layers, 1152 hidden, 16 heads) |
| Audio encoder | Cascades Conformer (24 blocks, 1024 dim, 16 heads) |
| Vision projector | 2-layer MLP (1152 → 3072 → 3072, GELU) |
set_modality() selects the correct adapter at runtime based on input types.CLI examples
- Image
- Audio
- Image + audio
Python
Audio input must be a 16 kHz mono waveform. The processor handles resampling automatically. The
<|image_1|> and <|audio_1|> placeholders are inserted by apply_chat_template — do not add them manually.MiniCPM-o (minicpmo)
MiniCPM-o (minicpmo)
minicpmo is an omni model (openbmb/MiniCPM-o-4_5) that supports text, image, and audio understanding. The processor and tokenizer code is ported in-tree, so --trust-remote-code is not required.MiniCPM-o includes a thinking mode that is enabled by default. You can disable it via chat_template_kwargs.CLI examples
- Image
- Audio
- Image + audio
- Disable thinking
Python
Do not manually add
<image> or <audio> markers when using apply_chat_template — placeholders are inserted automatically.MolmoPoint (molmo_point)
MolmoPoint (molmo_point)
molmo_point is Allen AI’s MolmoPoint model (allenai/MolmoPoint-8B), a vision-language model with pixel-precise pointing and grounding capabilities. In addition to standard VQA and description tasks, it can return exact (x, y) coordinates for objects named in the prompt.How pointing works
MolmoPoint extends the standard vocabulary with three special token types:- Patch tokens — select which image patch contains the target object
- Subpatch tokens — select a specific ViT sub-patch within that patch
- Location tokens — refine the point to a 3×3 grid cell within the sub-patch
<POINT_patch> <POINT_subpatch> <POINT_location> object_id.CLI
Python — point extraction and visualization
Moondream3 (moondream3)
Moondream3 (moondream3)
Moondream3 (
The model processes images via multi-crop (up to 12 crops), producing 729 tokens per crop via 14×14 patches on 378×378 crops. Global and local crop features are projected to 2048-dim via a 2-layer MLP.Moondream3 uses a prompt-only format (
moondream/moondream3-preview) is a Mixture-of-Experts vision-language model with 9.27B total parameters and approximately 2B active parameters per token. It uses a SigLIP-based vision encoder and an MoE text decoder.Architecture summary
| Component | Details |
|---|---|
| Model ID | moondream/moondream3-preview |
| Total parameters | ~9.27B (2B active per token) |
| Vision encoder | SigLIP ViT — 27 layers, 1152 dim, 16 heads, patch size 14, crop size 378 |
| Language model | 24 layers; layers 0–3 dense, layers 4–23 MoE (64 experts, top-8) |
| Tokenizer | moondream/starmie-v1 (SuperBPE) |
PROMPT_ONLY) — apply_chat_template is not required for basic use.CLI
Python
The model may emit a thinking token (
<|md_reserved_4|>) before the answer — this is expected behavior. Peak memory for the full bf16 model is approximately 24 GB.