Quickstart

This guide takes you from a fresh install to a working inference in four steps.

Install MLX-VLM

Install the package from PyPI:

pip install -U mlx-vlm

This pulls in all required dependencies. See the installation guide for extras and source install instructions.

Run text generation from the CLI

Use mlx_vlm.generate to send a text prompt to a model. The first run downloads the model from Hugging Face automatically.

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --prompt "Hello, how are you?"

The model is cached locally after the first download, so subsequent runs start immediately.

Generate text from an image

Pass an image URL or local file path with --image:

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --temperature 0.0 \
  --image http://images.cocodataset.org/val2017/000000039769.jpg

The model receives the image and generates a description. You can also pass a local path:

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --prompt "What objects are in this image?" \
  --image /path/to/image.jpg

To analyze multiple images in one prompt, pass --image multiple times: --image image1.jpg --image image2.jpg

Use the Python API

For scripting and application integration, use the Python API directly:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model and processor
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare the image and prompt
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply the model's chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate and print output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output.text)

load downloads and caches the model on first call. apply_chat_template formats the prompt correctly for the model’s expected input structure. generate returns a GenerationResult object; access the text via .text.

Try the chat UI

If you installed the [ui] extra, launch an interactive Gradio chat interface:

pip install -U "mlx-vlm[ui]"
mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Open the URL printed to your terminal (typically http://127.0.0.1:7860) to chat with the model in your browser, including image uploads.

The chat UI is the fastest way to explore a model’s capabilities interactively without writing any code.

Next steps

CLI reference

All flags for mlx_vlm.generate, video generation, and thinking budget options

Python API

Full reference for load, generate, stream_generate, and batch_generate

REST server

Serve models via an OpenAI-compatible HTTP API with streaming

Fine-tuning

Adapt models to your data using LoRA and QLoRA on Apple Silicon

Get Started

Inference

Fine-Tuning

Advanced

Models

Try the chat UI

Next steps

CLI reference

Python API

REST server

Fine-tuning

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Try the chat UI

​Next steps

CLI reference

Python API

REST server

Fine-tuning

Build docs developers (and LLMs) love

Try the chat UI

Next steps