The core workflow is three steps: load the model, format the prompt, then generate output.
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")
config = load_config("mlx-community/Qwen2-VL-2B-Instruct-4bit")
formatted_prompt = apply_chat_template(processor, config, "Describe this image.", num_images=1)
output = generate(model, processor, formatted_prompt, image=["image.jpg"], verbose=False)
print(output)
Step-by-step walkthrough
Load the model
load() downloads the model from Hugging Face (or reads from the local cache) and returns the model and processor objects.from mlx_vlm import load
from mlx_vlm.utils import load_config
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
Pass quantize_activations=True when working with mxfp8 or nvfp4 quantized models on NVIDIA GPUs:model, processor = load(model_path, quantize_activations=True)
Format the prompt
apply_chat_template() wraps your prompt in the correct chat template for the loaded model. Pass num_images and/or num_audios to insert the right number of media tokens.from mlx_vlm.prompt_utils import apply_chat_template
formatted_prompt = apply_chat_template(
processor,
config,
"Describe this image.",
num_images=1,
)
Generate output
Pass the formatted prompt, model, processor, and any media files to generate(). The function returns a GenerationResult object; access the text via .text or pass verbose=True to print tokens and timing to stdout.from mlx_vlm import generate
result = generate(
model,
processor,
formatted_prompt,
image=["http://images.cocodataset.org/val2017/000000039769.jpg"],
verbose=False,
)
print(result.text)
Usage examples
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output.text)
You can also pass PIL.Image.Image objects directly instead of file paths or URLs:
image = [Image.open("photo.jpg")]
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_audios=len(audio)
)
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output.text)
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = "Describe what you see and hear."
formatted_prompt = apply_chat_template(
processor, config, prompt,
num_images=len(image),
num_audios=len(audio),
)
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output.text)
Streaming with stream_generate
Use stream_generate() when you want to process tokens as they are produced rather than waiting for the full response. It yields GenerationResult objects, each containing the latest text segment and running statistics.
from mlx_vlm import load, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
formatted_prompt = apply_chat_template(processor, config, "What is in this image?", num_images=1)
for result in stream_generate(model, processor, formatted_prompt, image=image, max_tokens=200):
print(result.text, end="", flush=True)
print() # newline after stream ends
GenerationResult fields available on each yielded object:
| Field | Type | Description |
|---|
text | str | Text segment generated since the last yield |
token | int | Most recently generated token ID |
prompt_tokens | int | Number of tokens in the prompt |
generation_tokens | int | Number of tokens generated so far |
prompt_tps | float | Prompt processing speed (tokens/sec) |
generation_tps | float | Generation speed (tokens/sec) |
peak_memory | float | Peak memory usage in GB |
generate() parameters
The loaded model returned by load().
body.processor
PreTrainedTokenizer
required
The processor returned by load().
The formatted prompt string. Use apply_chat_template() to produce this value.
body.image
string | list[string] | list[PIL.Image.Image]
default:"None"
One or more images passed as file paths, URLs, or PIL.Image.Image objects.
body.audio
string | list[string]
default:"None"
One or more audio files passed as local paths or URLs.
When True, prints each token and a timing summary to stdout as generation proceeds.
Maximum number of tokens to generate.
Sampling temperature. Set to 0 for deterministic greedy decoding.
Nucleus sampling threshold. The model samples from the smallest set of tokens whose cumulative probability exceeds this value.
Restrict sampling to the top-k most probable tokens. 0 disables top-k filtering.
Minimum probability threshold relative to the top token. Tokens below this threshold are discarded.
Penalty applied to tokens that have already appeared in the context. Values above 1.0 discourage repetition.
body.repetition_context_size
Number of preceding tokens to consider when applying the repetition penalty.
Cap the KV cache to this number of tokens. Useful for very long prompts.
Quantize the KV cache to this many bits to reduce memory usage.
Number of tokens processed per prefill chunk. Lower values reduce peak memory at the cost of slower prefill.
body.resize_shape
integer | tuple[int, int]
default:"None"
Resize images to this shape before processing. Pass a single integer for a square crop or a (height, width) tuple.