Supported models

MLX-VLM supports 50+ vision-language model architectures. Pre-quantized weights for most models are available on the mlx-community Hugging Face organization and are downloaded automatically on first use — no manual conversion needed.

Models that require a custom processor or non-standard tokenizer are loaded automatically. A small number of models require --trust-remote-code on first download; see their individual sections in model-specific guides.

Qwen series

Alibaba’s Qwen visual models. Qwen2-VL and Qwen2.5-VL are among the most capable models in the library and support multi-image and video inputs.

Architecture	Notes
`qwen2_vl`	Multi-image and video support
`qwen2_5_vl`	Multi-image and video support
`qwen3_vl`	—
`qwen3_vl_moe`	Mixture-of-Experts variant
`qwen3_5`	Supports thinking/reasoning budget
`qwen3_5_moe`	MoE variant, supports thinking budget
`qwen3_omni_moe`	Omni MoE — image and audio support

Qwen3.5 and Qwen3.5-MoE support a configurable thinking budget. Pass --enable-thinking and --thinking-budget <N> to the CLI to limit tokens spent in the reasoning block.

LLaVA family

LLaVA-style models share a vision encoder + language model connector pattern.

Architecture	Notes
`llava`	Multi-image support
`llava_next`	Single-image only
`llava_bunny`	Remapped from `llava-qwen2` / `bunny-llama`
`fastvlm`	Apple’s FastVLM; remapped from `llava_qwen2`

Meta Llama

Architecture	Notes
`mllama`	Llama-3.2-Vision; single-image only
`llama4`	Multi-image support

Google

Architecture	Notes
`gemma3`	—
`gemma3n`	Audio support (image + audio inputs)
`paligemma`	Single-image only

Microsoft Phi

Architecture	Notes
`phi3_v`	Numbered image tokens (`<\|image_1\|>` format)
`phi4_siglip`	Phi-4 Reasoning Vision with SigLIP2 NaFlex encoder
`phi4mm`	Phi-4 Multimodal — audio support (image + audio + text)

DeepSeek

Architecture	Notes
`deepseek_vl_v2`	—
`deepseekocr`	OCR-specialized; uses `<\|grounding\|>` prompt tokens
`deepseekocr_2`	Second-generation OCR model

OCR-specialized

Models purpose-built for document parsing, text extraction, and layout analysis.

Architecture	Notes
`dots_ocr`	`dots.ocr` / `dots.mocr` — layout JSON, table/formula extraction
`glm_ocr`	GLM-based OCR
`paddleocr_vl`	PaddleOCR vision-language model

Other architectures

idefics2

IDEFICS2 multi-image model. Video support.

idefics3

IDEFICS3 multi-image model. Video support.

internvl_chat

InternVL chat model.

kimi_vl

Kimi VL from Moonshot AI.

minicpmo

MiniCPM-o omni model — image and audio support.

mistral3

Mistral 3 vision model.

mistral4

Mistral 4 vision model.

molmo

Allen AI Molmo (prompt-only image format).

molmo2

Molmo second generation.

molmo_point

MolmoPoint — pixel-precise pointing and grounding.

moondream3

Moondream3 — 9.27B MoE with ~2B active params.

pixtral

Mistral’s Pixtral vision model.

smolvlm

SmolVLM — compact vision-language model.

aya_vision

Cohere Aya Vision; remapped from cohere2_vision.

jina_vlm

Jina VLM; remapped from jvlm.

florence2

Florence-2 (prompt-only format).

hunyuan_vl

Tencent Hunyuan VL.

lfm2_vl

LFM2-VL; remapped from lfm2-vl.

ernie4_5_moe_vl

Baidu ERNIE 4.5 MoE VL.

glm4v / glm4v_moe

GLM-4V and its MoE variant.

sam3 / sam3_1

SAM3 and SAM3.1 visual models.

multi_modality

Generic multi-modality architecture (single-image only).

Capability summary

Capability	Supported architectures
Audio input	`gemma3n`, `qwen3_omni_moe`, `minicpmo`, `phi4mm`
Video input	`qwen2_vl`, `qwen2_5_vl`, `idefics3`, `llava`
Multi-image	All except `llava_next`, `llava-qwen2`, `bunny-llama`, `paligemma`, `multi_modality`, `mllama`
Fine-tuning (LoRA/QLoRA)	All except `gemma3n`, `qwen3_omni_moe`
Thinking/reasoning budget	`qwen3_5`, `qwen3_5_moe`, and any model that emits `<think>` tokens
Pointing/grounding	`molmo_point`, `deepseekocr`, `deepseekocr_2`

Finding pre-quantized models

The mlx-community organization on Hugging Face hosts quantized versions of most supported models. Pass the repo ID directly to any MLX-VLM command:

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --prompt "Describe this image." \
  --image path/to/image.jpg

If a model is not yet available in mlx-community, you can convert it yourself using mlx_vlm.convert. See model conversion for details.

Get Started

Inference

Fine-Tuning

Advanced

Models

Supported models

Qwen series

LLaVA family

Meta Llama

Google

Microsoft Phi

DeepSeek

OCR-specialized

Other architectures

idefics2

idefics3

internvl_chat

kimi_vl

minicpmo

mistral3

mistral4

molmo

molmo2

molmo_point

moondream3

pixtral

smolvlm

aya_vision

jina_vlm

florence2

hunyuan_vl

lfm2_vl

ernie4_5_moe_vl

glm4v / glm4v_moe

sam3 / sam3_1

multi_modality

Capability summary

Finding pre-quantized models

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Qwen series

​LLaVA family

​Meta Llama

​Google

​Microsoft Phi

​DeepSeek

​OCR-specialized

​Other architectures

idefics2

idefics3

internvl_chat

kimi_vl

minicpmo

mistral3

mistral4

molmo

molmo2

molmo_point

moondream3

pixtral

smolvlm

aya_vision

jina_vlm

florence2

hunyuan_vl

lfm2_vl

ernie4_5_moe_vl

glm4v / glm4v_moe

sam3 / sam3_1

multi_modality

​Capability summary

​Finding pre-quantized models

Build docs developers (and LLMs) love

Qwen series

LLaVA family

Meta Llama

Google

Microsoft Phi

DeepSeek

OCR-specialized

Other architectures

Capability summary

Finding pre-quantized models