Skip to main content
MLX-VLM supports 50+ vision-language model architectures. Pre-quantized weights for most models are available on the mlx-community Hugging Face organization and are downloaded automatically on first use — no manual conversion needed.
Models that require a custom processor or non-standard tokenizer are loaded automatically. A small number of models require --trust-remote-code on first download; see their individual sections in model-specific guides.

Qwen series

Alibaba’s Qwen visual models. Qwen2-VL and Qwen2.5-VL are among the most capable models in the library and support multi-image and video inputs.
ArchitectureNotes
qwen2_vlMulti-image and video support
qwen2_5_vlMulti-image and video support
qwen3_vl
qwen3_vl_moeMixture-of-Experts variant
qwen3_5Supports thinking/reasoning budget
qwen3_5_moeMoE variant, supports thinking budget
qwen3_omni_moeOmni MoE — image and audio support
Qwen3.5 and Qwen3.5-MoE support a configurable thinking budget. Pass --enable-thinking and --thinking-budget <N> to the CLI to limit tokens spent in the reasoning block.

LLaVA family

LLaVA-style models share a vision encoder + language model connector pattern.
ArchitectureNotes
llavaMulti-image support
llava_nextSingle-image only
llava_bunnyRemapped from llava-qwen2 / bunny-llama
fastvlmApple’s FastVLM; remapped from llava_qwen2

Meta Llama

ArchitectureNotes
mllamaLlama-3.2-Vision; single-image only
llama4Multi-image support

Google

ArchitectureNotes
gemma3
gemma3nAudio support (image + audio inputs)
paligemmaSingle-image only

Microsoft Phi

ArchitectureNotes
phi3_vNumbered image tokens (<|image_1|> format)
phi4_siglipPhi-4 Reasoning Vision with SigLIP2 NaFlex encoder
phi4mmPhi-4 Multimodal — audio support (image + audio + text)

DeepSeek

ArchitectureNotes
deepseek_vl_v2
deepseekocrOCR-specialized; uses <|grounding|> prompt tokens
deepseekocr_2Second-generation OCR model

OCR-specialized

Models purpose-built for document parsing, text extraction, and layout analysis.
ArchitectureNotes
dots_ocrdots.ocr / dots.mocr — layout JSON, table/formula extraction
glm_ocrGLM-based OCR
paddleocr_vlPaddleOCR vision-language model

Other architectures

idefics2

IDEFICS2 multi-image model. Video support.

idefics3

IDEFICS3 multi-image model. Video support.

internvl_chat

InternVL chat model.

kimi_vl

Kimi VL from Moonshot AI.

minicpmo

MiniCPM-o omni model — image and audio support.

mistral3

Mistral 3 vision model.

mistral4

Mistral 4 vision model.

molmo

Allen AI Molmo (prompt-only image format).

molmo2

Molmo second generation.

molmo_point

MolmoPoint — pixel-precise pointing and grounding.

moondream3

Moondream3 — 9.27B MoE with ~2B active params.

pixtral

Mistral’s Pixtral vision model.

smolvlm

SmolVLM — compact vision-language model.

aya_vision

Cohere Aya Vision; remapped from cohere2_vision.

jina_vlm

Jina VLM; remapped from jvlm.

florence2

Florence-2 (prompt-only format).

hunyuan_vl

Tencent Hunyuan VL.

lfm2_vl

LFM2-VL; remapped from lfm2-vl.

ernie4_5_moe_vl

Baidu ERNIE 4.5 MoE VL.

glm4v / glm4v_moe

GLM-4V and its MoE variant.

sam3 / sam3_1

SAM3 and SAM3.1 visual models.

multi_modality

Generic multi-modality architecture (single-image only).

Capability summary

CapabilitySupported architectures
Audio inputgemma3n, qwen3_omni_moe, minicpmo, phi4mm
Video inputqwen2_vl, qwen2_5_vl, idefics3, llava
Multi-imageAll except llava_next, llava-qwen2, bunny-llama, paligemma, multi_modality, mllama
Fine-tuning (LoRA/QLoRA)All except gemma3n, qwen3_omni_moe
Thinking/reasoning budgetqwen3_5, qwen3_5_moe, and any model that emits <think> tokens
Pointing/groundingmolmo_point, deepseekocr, deepseekocr_2

Finding pre-quantized models

The mlx-community organization on Hugging Face hosts quantized versions of most supported models. Pass the repo ID directly to any MLX-VLM command:
mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --prompt "Describe this image." \
  --image path/to/image.jpg
If a model is not yet available in mlx-community, you can convert it yourself using mlx_vlm.convert. See model conversion for details.

Build docs developers (and LLMs) love