Models that require a custom processor or non-standard tokenizer are loaded automatically. A small number of models require
--trust-remote-code on first download; see their individual sections in model-specific guides.Qwen series
Alibaba’s Qwen visual models. Qwen2-VL and Qwen2.5-VL are among the most capable models in the library and support multi-image and video inputs.| Architecture | Notes |
|---|---|
qwen2_vl | Multi-image and video support |
qwen2_5_vl | Multi-image and video support |
qwen3_vl | — |
qwen3_vl_moe | Mixture-of-Experts variant |
qwen3_5 | Supports thinking/reasoning budget |
qwen3_5_moe | MoE variant, supports thinking budget |
qwen3_omni_moe | Omni MoE — image and audio support |
LLaVA family
LLaVA-style models share a vision encoder + language model connector pattern.| Architecture | Notes |
|---|---|
llava | Multi-image support |
llava_next | Single-image only |
llava_bunny | Remapped from llava-qwen2 / bunny-llama |
fastvlm | Apple’s FastVLM; remapped from llava_qwen2 |
Meta Llama
| Architecture | Notes |
|---|---|
mllama | Llama-3.2-Vision; single-image only |
llama4 | Multi-image support |
| Architecture | Notes |
|---|---|
gemma3 | — |
gemma3n | Audio support (image + audio inputs) |
paligemma | Single-image only |
Microsoft Phi
| Architecture | Notes |
|---|---|
phi3_v | Numbered image tokens (<|image_1|> format) |
phi4_siglip | Phi-4 Reasoning Vision with SigLIP2 NaFlex encoder |
phi4mm | Phi-4 Multimodal — audio support (image + audio + text) |
DeepSeek
| Architecture | Notes |
|---|---|
deepseek_vl_v2 | — |
deepseekocr | OCR-specialized; uses <|grounding|> prompt tokens |
deepseekocr_2 | Second-generation OCR model |
OCR-specialized
Models purpose-built for document parsing, text extraction, and layout analysis.| Architecture | Notes |
|---|---|
dots_ocr | dots.ocr / dots.mocr — layout JSON, table/formula extraction |
glm_ocr | GLM-based OCR |
paddleocr_vl | PaddleOCR vision-language model |
Other architectures
idefics2
IDEFICS2 multi-image model. Video support.
idefics3
IDEFICS3 multi-image model. Video support.
internvl_chat
InternVL chat model.
kimi_vl
Kimi VL from Moonshot AI.
minicpmo
MiniCPM-o omni model — image and audio support.
mistral3
Mistral 3 vision model.
mistral4
Mistral 4 vision model.
molmo
Allen AI Molmo (prompt-only image format).
molmo2
Molmo second generation.
molmo_point
MolmoPoint — pixel-precise pointing and grounding.
moondream3
Moondream3 — 9.27B MoE with ~2B active params.
pixtral
Mistral’s Pixtral vision model.
smolvlm
SmolVLM — compact vision-language model.
aya_vision
Cohere Aya Vision; remapped from
cohere2_vision.jina_vlm
Jina VLM; remapped from
jvlm.florence2
Florence-2 (prompt-only format).
hunyuan_vl
Tencent Hunyuan VL.
lfm2_vl
LFM2-VL; remapped from
lfm2-vl.ernie4_5_moe_vl
Baidu ERNIE 4.5 MoE VL.
glm4v / glm4v_moe
GLM-4V and its MoE variant.
sam3 / sam3_1
SAM3 and SAM3.1 visual models.
multi_modality
Generic multi-modality architecture (single-image only).
Capability summary
| Capability | Supported architectures |
|---|---|
| Audio input | gemma3n, qwen3_omni_moe, minicpmo, phi4mm |
| Video input | qwen2_vl, qwen2_5_vl, idefics3, llava |
| Multi-image | All except llava_next, llava-qwen2, bunny-llama, paligemma, multi_modality, mllama |
| Fine-tuning (LoRA/QLoRA) | All except gemma3n, qwen3_omni_moe |
| Thinking/reasoning budget | qwen3_5, qwen3_5_moe, and any model that emits <think> tokens |
| Pointing/grounding | molmo_point, deepseekocr, deepseekocr_2 |
Finding pre-quantized models
Themlx-community organization on Hugging Face hosts quantized versions of most supported models. Pass the repo ID directly to any MLX-VLM command:
mlx-community, you can convert it yourself using mlx_vlm.convert. See model conversion for details.