Requirements
- Python 3.10 or later
- macOS with Apple Silicon (M1 / M2 / M3 / M4) for native MLX acceleration
- pip (included with Python)
MLX-VLM runs on Intel Macs and Linux via the
mlx-cpu backend, and on NVIDIA GPUs via the mlx-cuda backend. Performance and feature coverage are best on Apple Silicon.Install
- pip (recommended)
- From source
Install the package
Run the following command to install MLX-VLM from PyPI:This installs the core library along with all required dependencies:
mlx, transformers, mlx-lm, Pillow, fastapi, opencv-python, and more.Optional extras
Install extras alongside the base package using the bracket syntax:| Extra | Package added | When to use |
|---|---|---|
ui | gradio>=5.19.0 | You want the built-in mlx_vlm.chat_ui web interface |
cuda | mlx-cuda | You are running on a machine with an NVIDIA GPU |
cpu | mlx-cpu | You need a pure-CPU fallback (Intel Mac, Linux without GPU) |
MLX CUDA support
MLX-VLM can run on NVIDIA GPUs through the experimental MLX CUDA backend.mxfp8 or nvfp4 modes require activation quantization when running on CUDA. Pass the -qa flag to mlx_vlm.generate, or set quantize_activations=True in the Python API:
On Apple Silicon the
-qa flag is not required — Metal-backed MLX handles mxfp8 and nvfp4 models natively.Installed CLI commands
After installation the following commands are available on yourPATH:
| Command | Description |
|---|---|
mlx_vlm.generate | Run inference from the command line |
mlx_vlm.chat_ui | Launch the Gradio chat interface ([ui] extra required) |
mlx_vlm.server | Start the FastAPI REST server |
mlx_vlm.convert | Convert and quantize Hugging Face models to MLX format |