Skip to main content
The core workflow is three steps: load the model, format the prompt, then generate output.
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")
config = load_config("mlx-community/Qwen2-VL-2B-Instruct-4bit")

formatted_prompt = apply_chat_template(processor, config, "Describe this image.", num_images=1)
output = generate(model, processor, formatted_prompt, image=["image.jpg"], verbose=False)
print(output)

Step-by-step walkthrough

1

Load the model

load() downloads the model from Hugging Face (or reads from the local cache) and returns the model and processor objects.
from mlx_vlm import load
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
Pass quantize_activations=True when working with mxfp8 or nvfp4 quantized models on NVIDIA GPUs:
model, processor = load(model_path, quantize_activations=True)
2

Format the prompt

apply_chat_template() wraps your prompt in the correct chat template for the loaded model. Pass num_images and/or num_audios to insert the right number of media tokens.
from mlx_vlm.prompt_utils import apply_chat_template

formatted_prompt = apply_chat_template(
    processor,
    config,
    "Describe this image.",
    num_images=1,
)
3

Generate output

Pass the formatted prompt, model, processor, and any media files to generate(). The function returns a GenerationResult object; access the text via .text or pass verbose=True to print tokens and timing to stdout.
from mlx_vlm import generate

result = generate(
    model,
    processor,
    formatted_prompt,
    image=["http://images.cocodataset.org/val2017/000000039769.jpg"],
    verbose=False,
)
print(result.text)

Usage examples

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output.text)
You can also pass PIL.Image.Image objects directly instead of file paths or URLs: image = [Image.open("photo.jpg")]

Streaming with stream_generate

Use stream_generate() when you want to process tokens as they are produced rather than waiting for the full response. It yields GenerationResult objects, each containing the latest text segment and running statistics.
from mlx_vlm import load, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
formatted_prompt = apply_chat_template(processor, config, "What is in this image?", num_images=1)

for result in stream_generate(model, processor, formatted_prompt, image=image, max_tokens=200):
    print(result.text, end="", flush=True)
print()  # newline after stream ends
GenerationResult fields available on each yielded object:
FieldTypeDescription
textstrText segment generated since the last yield
tokenintMost recently generated token ID
prompt_tokensintNumber of tokens in the prompt
generation_tokensintNumber of tokens generated so far
prompt_tpsfloatPrompt processing speed (tokens/sec)
generation_tpsfloatGeneration speed (tokens/sec)
peak_memoryfloatPeak memory usage in GB

generate() parameters

body.model
nn.Module
required
The loaded model returned by load().
body.processor
PreTrainedTokenizer
required
The processor returned by load().
body.prompt
string
required
The formatted prompt string. Use apply_chat_template() to produce this value.
body.image
string | list[string] | list[PIL.Image.Image]
default:"None"
One or more images passed as file paths, URLs, or PIL.Image.Image objects.
body.audio
string | list[string]
default:"None"
One or more audio files passed as local paths or URLs.
body.verbose
boolean
default:"False"
When True, prints each token and a timing summary to stdout as generation proceeds.
body.max_tokens
integer
default:"256"
Maximum number of tokens to generate.
body.temperature
float
default:"0.0"
Sampling temperature. Set to 0 for deterministic greedy decoding.
body.top_p
float
default:"1.0"
Nucleus sampling threshold. The model samples from the smallest set of tokens whose cumulative probability exceeds this value.
body.top_k
integer
default:"0"
Restrict sampling to the top-k most probable tokens. 0 disables top-k filtering.
body.min_p
float
default:"0.0"
Minimum probability threshold relative to the top token. Tokens below this threshold are discarded.
body.repetition_penalty
float
default:"None"
Penalty applied to tokens that have already appeared in the context. Values above 1.0 discourage repetition.
body.repetition_context_size
integer
default:"20"
Number of preceding tokens to consider when applying the repetition penalty.
body.max_kv_size
integer
default:"None"
Cap the KV cache to this number of tokens. Useful for very long prompts.
body.kv_bits
integer
default:"None"
Quantize the KV cache to this many bits to reduce memory usage.
body.prefill_step_size
integer
default:"2048"
Number of tokens processed per prefill chunk. Lower values reduce peak memory at the cost of slower prefill.
body.resize_shape
integer | tuple[int, int]
default:"None"
Resize images to this shape before processing. Pass a single integer for a square crop or a (height, width) tuple.

Build docs developers (and LLMs) love