Skip to main content
MLX-VLM brings state-of-the-art Vision Language Models (VLMs) to your Mac. Built on Apple’s MLX framework, it delivers fast, efficient inference on Apple Silicon — no cloud required.

Installation

Install MLX-VLM with pip and get your environment ready

Quickstart

Run your first VLM inference in under 5 minutes

Python API

Integrate VLM inference directly into your Python code

REST API Server

Serve models via an OpenAI-compatible HTTP API

What you can do

CLI Inference

Generate text from images, audio, and video from the command line

Multi-image analysis

Analyze multiple images simultaneously in one prompt

Fine-tune with LoRA

Adapt models to your task using LoRA and QLoRA on your own data

Model conversion

Convert and quantize Hugging Face models for MLX

Get started in 3 steps

1

Install the package

pip install -U mlx-vlm
2

Run inference from the CLI

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 200 \
  --prompt "Describe this image." \
  --image http://images.cocodataset.org/val2017/000000039769.jpg
3

Or use the Python API

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")
config = load_config("mlx-community/Qwen2-VL-2B-Instruct-4bit")

prompt = apply_chat_template(processor, config, "Describe this image.", num_images=1)
output = generate(model, processor, prompt, image=["image.jpg"])
print(output.text)

Key features

  • 50+ model architectures — Qwen2/3 VL, LLaVA, Gemma 3, Phi-4, DeepSeek, Mllama, and more
  • Omni model support — images, audio, and video inputs in a single model
  • OpenAI-compatible server — drop-in replacement for OpenAI API calls with streaming
  • LoRA & QLoRA fine-tuning — train on your own data directly on Apple Silicon
  • Model quantization — convert Hugging Face models to 4-bit, 8-bit, and mixed-precision formats
  • Thinking model support — configurable token budgets for reasoning models
MLX-VLM downloads models from Hugging Face Hub automatically on first use. Pre-quantized models in the mlx-community organization are ready to use with no conversion needed.

Build docs developers (and LLMs) love