Multi-image analysis

MLX-VLM supports sending multiple images in one request. This lets you ask questions that span several images — for example, comparing two photos, describing a sequence of frames, or referencing multiple charts in a single prompt.

Supported models

Not all architectures support multi-image inputs. The following model families accept more than one image per request:

Model family	Notes
Qwen2-VL	Full multi-image support
Qwen2.5-VL	Full multi-image support
LLaVA / LLaVA-Next	Multi-image support
Idefics2 / Idefics3	Multi-image support
Phi-4 Multimodal	Multi-image support

Models that do not natively support multi-image inputs will process only the first image in the list, or may produce degraded results. Check the model card on Hugging Face to confirm support.

Python example

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = model.config

images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(images)
)

output = generate(model, processor, formatted_prompt, images, verbose=False)
print(output.text)

Pass the full list of images to both apply_chat_template (via num_images) and to generate. The num_images argument tells the template how many image tokens to insert into the prompt; the list passed to generate provides the actual pixel data.

CLI example

Pass multiple paths or URLs to --image as space-separated values:

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 100 \
  --prompt "Compare these images" \
  --image path/to/image1.jpg path/to/image2.jpg

You can mix local paths and URLs:

mlx_vlm.generate \
  --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
  --max-tokens 200 \
  --prompt "What do these two scenes have in common?" \
  --image ./local_photo.jpg http://images.cocodataset.org/val2017/000000039769.jpg

Using PIL images

You can pass PIL.Image.Image objects directly in the Python API instead of file paths:

from PIL import Image
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = model.config

images = [
    Image.open("path/to/image1.jpg"),
    Image.open("path/to/image2.jpg"),
]
prompt = "What differences do you notice between these two images?"

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(images)
)

output = generate(model, processor, formatted_prompt, images, verbose=False)
print(output.text)

Using PIL.Image.Image objects is convenient when images are already loaded in memory (for example, from a preprocessing pipeline) and avoids writing temporary files to disk.

Get Started

Inference

Fine-Tuning

Advanced

Models

Multi-image analysis

Supported models

Python example

CLI example

Using PIL images

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Supported models

​Python example

​CLI example

​Using PIL images

Build docs developers (and LLMs) love

Supported models

Python example

CLI example

Using PIL images