Skip to main content
This guide takes you from a fresh install to a working inference in four steps.
1

Install MLX-VLM

Install the package from PyPI:
pip install -U mlx-vlm
This pulls in all required dependencies. See the installation guide for extras and source install instructions.
2

Run text generation from the CLI

Use mlx_vlm.generate to send a text prompt to a model. The first run downloads the model from Hugging Face automatically.
mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --prompt "Hello, how are you?"
The model is cached locally after the first download, so subsequent runs start immediately.
3

Generate text from an image

Pass an image URL or local file path with --image:
mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --temperature 0.0 \
  --image http://images.cocodataset.org/val2017/000000039769.jpg
The model receives the image and generates a description. You can also pass a local path:
mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --prompt "What objects are in this image?" \
  --image /path/to/image.jpg
To analyze multiple images in one prompt, pass --image multiple times: --image image1.jpg --image image2.jpg
4

Use the Python API

For scripting and application integration, use the Python API directly:
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model and processor
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare the image and prompt
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply the model's chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate and print output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output.text)
load downloads and caches the model on first call. apply_chat_template formats the prompt correctly for the model’s expected input structure. generate returns a GenerationResult object; access the text via .text.

Try the chat UI

If you installed the [ui] extra, launch an interactive Gradio chat interface:
pip install -U "mlx-vlm[ui]"
mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
Open the URL printed to your terminal (typically http://127.0.0.1:7860) to chat with the model in your browser, including image uploads.
The chat UI is the fastest way to explore a model’s capabilities interactively without writing any code.

Next steps

CLI reference

All flags for mlx_vlm.generate, video generation, and thinking budget options

Python API

Full reference for load, generate, stream_generate, and batch_generate

REST server

Serve models via an OpenAI-compatible HTTP API with streaming

Fine-tuning

Adapt models to your data using LoRA and QLoRA on Apple Silicon

Build docs developers (and LLMs) love