Skip to main content
MLX-VLM lets you run state-of-the-art Vision Language Models locally on your Mac. Built on Apple’s MLX framework, it delivers fast, memory-efficient inference on Apple Silicon — no cloud account or GPU server required.

What is MLX-VLM?

MLX-VLM is a Python library that wraps the MLX framework to make it easy to load, run, and fine-tune multi-modal models. It supports models that understand images, audio, and video alongside text, covering everything from quick CLI one-liners to a production-ready OpenAI-compatible REST server. The library is designed to feel familiar: if you’ve used the Hugging Face transformers API or OpenAI’s chat completions format, MLX-VLM follows the same conventions.

Key capabilities

  • CLI inference — generate text from images, audio, and video directly from your terminal with mlx_vlm.generate
  • Python API — a simple load / generate interface for embedding inference in your own code
  • OpenAI-compatible REST server — serve models over HTTP with streaming support via mlx_vlm.server
  • LoRA and QLoRA fine-tuning — adapt any supported model to your own dataset on-device
  • Model conversion and quantization — convert Hugging Face checkpoints to 4-bit, 8-bit, and mixed-precision MLX format
  • Omni model support — process images, audio clips, and video frames in a single prompt with supported models
  • Multi-image reasoning — pass multiple images in one request for comparison or analysis tasks
  • Thinking model support — configurable token budgets for reasoning models like Qwen3.5

Supported model families

MLX-VLM includes implementations for over 50 model architectures. The table below lists the currently supported families.
FamilyNotes
Qwen2 VL / Qwen2.5 VLMulti-image and video support
Qwen3 VL / Qwen3.5 / Qwen3 OmniThinking budget support
LLaVA / LLaVA-Next / LLaVA-BunnyClassic open-source VLMs
Gemma 3 / Gemma 3nGoogle’s vision-language models with audio support
Phi-3 Vision / Phi-4 Multimodal / Phi-4 Reasoning VisionMicrosoft Phi series
DeepSeek VL v2 / DeepSeek OCR / DeepSeek OCR-2DeepSeek visual understanding
MllamaMeta’s multi-modal Llama
Llama 4Meta Llama 4 series
Mistral 3 / Mistral 4 / PixtralMistral vision models
Idefics2 / Idefics3HuggingFace IDEFICS series
PaligemmaGoogle PaLI-Gemma
MiniCPM-oOmni model with image and audio
Molmo / Molmo 2 / MolmoPointAI2 Molmo family
InternVL ChatInternLM visual models
Florence2Microsoft Florence
Moondream3Efficient edge VLM
SmolVLMSmall, fast vision-language model
FastVLMHigh-throughput VLM
Kimi VLMoonshot visual model
Jina VLMJina AI visual model
GLM-4V / GLM-4V MoE / GLM-OCRZhipu AI vision models
Aya VisionCohere Aya vision model
HunyuanVLTencent Hunyuan VL
DOTS OCR / DOTS MOCRDocument OCR models
PaddleOCR VLPaddlePaddle OCR
SAM3 / SAM3.1Segment Anything Model 3
Ernie 4.5 MoE VLBaidu Ernie vision model
LFM2 VLLFM-2 visual model
New architectures are added regularly. Check the mlx_vlm/models directory in the repository for the most up-to-date list.

The mlx-community organization

Pre-quantized, ready-to-use models are published to the mlx-community organization on Hugging Face. When you pass a model identifier like mlx-community/Qwen2-VL-2B-Instruct-4bit, MLX-VLM downloads it automatically on first use and caches it locally. Using mlx-community models means you skip the conversion step entirely — just install the library and start generating.
Model names in mlx-community follow the pattern <ModelName>-<Size>-<Quantization>, for example Qwen2-VL-2B-Instruct-4bit (2B parameters, 4-bit quantized). Smaller quantizations use less memory and run faster at some cost to quality.

Where to go next

Installation

Install MLX-VLM with pip and verify your environment

Quickstart

Run your first VLM inference in minutes

Python API

Integrate VLM inference into your Python code

REST server

Serve models via an OpenAI-compatible HTTP API

Build docs developers (and LLMs) love