Skip to main content

Requirements

  • Python 3.10 or later
  • macOS with Apple Silicon (M1 / M2 / M3 / M4) for native MLX acceleration
  • pip (included with Python)
MLX-VLM runs on Intel Macs and Linux via the mlx-cpu backend, and on NVIDIA GPUs via the mlx-cuda backend. Performance and feature coverage are best on Apple Silicon.

Install

Optional extras

Install extras alongside the base package using the bracket syntax:
pip install -U "mlx-vlm[ui]"        # Gradio chat UI
pip install -U "mlx-vlm[cuda]"      # NVIDIA GPU support via MLX CUDA
pip install -U "mlx-vlm[cpu]"       # CPU-only backend (no Metal/CUDA)
ExtraPackage addedWhen to use
uigradio>=5.19.0You want the built-in mlx_vlm.chat_ui web interface
cudamlx-cudaYou are running on a machine with an NVIDIA GPU
cpumlx-cpuYou need a pure-CPU fallback (Intel Mac, Linux without GPU)
The cuda and cpu extras replace the default Metal-accelerated mlx backend. Do not install both cuda and cpu in the same environment.

MLX CUDA support

MLX-VLM can run on NVIDIA GPUs through the experimental MLX CUDA backend.
pip install -U "mlx-vlm[cuda]"
Models quantized with mxfp8 or nvfp4 modes require activation quantization when running on CUDA. Pass the -qa flag to mlx_vlm.generate, or set quantize_activations=True in the Python API:
mlx_vlm.generate \
  --model /path/to/mxfp8-model \
  --prompt "Describe this image." \
  --image /path/to/image.jpg \
  -qa
On Apple Silicon the -qa flag is not required — Metal-backed MLX handles mxfp8 and nvfp4 models natively.

Installed CLI commands

After installation the following commands are available on your PATH:
CommandDescription
mlx_vlm.generateRun inference from the command line
mlx_vlm.chat_uiLaunch the Gradio chat interface ([ui] extra required)
mlx_vlm.serverStart the FastAPI REST server
mlx_vlm.convertConvert and quantize Hugging Face models to MLX format

Build docs developers (and LLMs) love