What is MLX-VLM?
MLX-VLM is a Python library that wraps the MLX framework to make it easy to load, run, and fine-tune multi-modal models. It supports models that understand images, audio, and video alongside text, covering everything from quick CLI one-liners to a production-ready OpenAI-compatible REST server. The library is designed to feel familiar: if you’ve used the Hugging Facetransformers API or OpenAI’s chat completions format, MLX-VLM follows the same conventions.
Key capabilities
- CLI inference — generate text from images, audio, and video directly from your terminal with
mlx_vlm.generate - Python API — a simple
load/generateinterface for embedding inference in your own code - OpenAI-compatible REST server — serve models over HTTP with streaming support via
mlx_vlm.server - LoRA and QLoRA fine-tuning — adapt any supported model to your own dataset on-device
- Model conversion and quantization — convert Hugging Face checkpoints to 4-bit, 8-bit, and mixed-precision MLX format
- Omni model support — process images, audio clips, and video frames in a single prompt with supported models
- Multi-image reasoning — pass multiple images in one request for comparison or analysis tasks
- Thinking model support — configurable token budgets for reasoning models like Qwen3.5
Supported model families
MLX-VLM includes implementations for over 50 model architectures. The table below lists the currently supported families.| Family | Notes |
|---|---|
| Qwen2 VL / Qwen2.5 VL | Multi-image and video support |
| Qwen3 VL / Qwen3.5 / Qwen3 Omni | Thinking budget support |
| LLaVA / LLaVA-Next / LLaVA-Bunny | Classic open-source VLMs |
| Gemma 3 / Gemma 3n | Google’s vision-language models with audio support |
| Phi-3 Vision / Phi-4 Multimodal / Phi-4 Reasoning Vision | Microsoft Phi series |
| DeepSeek VL v2 / DeepSeek OCR / DeepSeek OCR-2 | DeepSeek visual understanding |
| Mllama | Meta’s multi-modal Llama |
| Llama 4 | Meta Llama 4 series |
| Mistral 3 / Mistral 4 / Pixtral | Mistral vision models |
| Idefics2 / Idefics3 | HuggingFace IDEFICS series |
| Paligemma | Google PaLI-Gemma |
| MiniCPM-o | Omni model with image and audio |
| Molmo / Molmo 2 / MolmoPoint | AI2 Molmo family |
| InternVL Chat | InternLM visual models |
| Florence2 | Microsoft Florence |
| Moondream3 | Efficient edge VLM |
| SmolVLM | Small, fast vision-language model |
| FastVLM | High-throughput VLM |
| Kimi VL | Moonshot visual model |
| Jina VLM | Jina AI visual model |
| GLM-4V / GLM-4V MoE / GLM-OCR | Zhipu AI vision models |
| Aya Vision | Cohere Aya vision model |
| HunyuanVL | Tencent Hunyuan VL |
| DOTS OCR / DOTS MOCR | Document OCR models |
| PaddleOCR VL | PaddlePaddle OCR |
| SAM3 / SAM3.1 | Segment Anything Model 3 |
| Ernie 4.5 MoE VL | Baidu Ernie vision model |
| LFM2 VL | LFM-2 visual model |
New architectures are added regularly. Check the mlx_vlm/models directory in the repository for the most up-to-date list.
The mlx-community organization
Pre-quantized, ready-to-use models are published to themlx-community organization on Hugging Face. When you pass a model identifier like mlx-community/Qwen2-VL-2B-Instruct-4bit, MLX-VLM downloads it automatically on first use and caches it locally.
Using mlx-community models means you skip the conversion step entirely — just install the library and start generating.
Where to go next
Installation
Install MLX-VLM with pip and verify your environment
Quickstart
Run your first VLM inference in minutes
Python API
Integrate VLM inference into your Python code
REST server
Serve models via an OpenAI-compatible HTTP API