Installation
Install MLX-VLM with pip and get your environment ready
Quickstart
Run your first VLM inference in under 5 minutes
Python API
Integrate VLM inference directly into your Python code
REST API Server
Serve models via an OpenAI-compatible HTTP API
What you can do
CLI Inference
Generate text from images, audio, and video from the command line
Multi-image analysis
Analyze multiple images simultaneously in one prompt
Fine-tune with LoRA
Adapt models to your task using LoRA and QLoRA on your own data
Model conversion
Convert and quantize Hugging Face models for MLX
Get started in 3 steps
Key features
- 50+ model architectures — Qwen2/3 VL, LLaVA, Gemma 3, Phi-4, DeepSeek, Mllama, and more
- Omni model support — images, audio, and video inputs in a single model
- OpenAI-compatible server — drop-in replacement for OpenAI API calls with streaming
- LoRA & QLoRA fine-tuning — train on your own data directly on Apple Silicon
- Model quantization — convert Hugging Face models to 4-bit, 8-bit, and mixed-precision formats
- Thinking model support — configurable token budgets for reasoning models