Python API

The core workflow is three steps: load the model, format the prompt, then generate output.

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("mlx-community/Qwen2-VL-2B-Instruct-4bit")
config = load_config("mlx-community/Qwen2-VL-2B-Instruct-4bit")

formatted_prompt = apply_chat_template(processor, config, "Describe this image.", num_images=1)
output = generate(model, processor, formatted_prompt, image=["image.jpg"], verbose=False)
print(output)

Step-by-step walkthrough

Load the model

load() downloads the model from Hugging Face (or reads from the local cache) and returns the model and processor objects.

from mlx_vlm import load
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

Pass quantize_activations=True when working with mxfp8 or nvfp4 quantized models on NVIDIA GPUs:

model, processor = load(model_path, quantize_activations=True)

Format the prompt

apply_chat_template() wraps your prompt in the correct chat template for the loaded model. Pass num_images and/or num_audios to insert the right number of media tokens.

from mlx_vlm.prompt_utils import apply_chat_template

formatted_prompt = apply_chat_template(
    processor,
    config,
    "Describe this image.",
    num_images=1,
)

Generate output

Pass the formatted prompt, model, processor, and any media files to generate(). The function returns a GenerationResult object; access the text via .text or pass verbose=True to print tokens and timing to stdout.

from mlx_vlm import generate

result = generate(
    model,
    processor,
    formatted_prompt,
    image=["http://images.cocodataset.org/val2017/000000039769.jpg"],
    verbose=False,
)
print(result.text)

Usage examples

Image inference
Audio inference
Multi-modal (image + audio)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output.text)

You can also pass PIL.Image.Image objects directly instead of file paths or URLs: image = [Image.open("photo.jpg")]

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_audios=len(audio)
)

output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output.text)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = "Describe what you see and hear."

formatted_prompt = apply_chat_template(
    processor, config, prompt,
    num_images=len(image),
    num_audios=len(audio),
)

output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output.text)

Streaming with stream_generate

Use stream_generate() when you want to process tokens as they are produced rather than waiting for the full response. It yields GenerationResult objects, each containing the latest text segment and running statistics.

from mlx_vlm import load, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
formatted_prompt = apply_chat_template(processor, config, "What is in this image?", num_images=1)

for result in stream_generate(model, processor, formatted_prompt, image=image, max_tokens=200):
    print(result.text, end="", flush=True)
print()  # newline after stream ends

GenerationResult fields available on each yielded object:

Field	Type	Description
`text`	`str`	Text segment generated since the last yield
`token`	`int`	Most recently generated token ID
`prompt_tokens`	`int`	Number of tokens in the prompt
`generation_tokens`	`int`	Number of tokens generated so far
`prompt_tps`	`float`	Prompt processing speed (tokens/sec)
`generation_tps`	`float`	Generation speed (tokens/sec)
`peak_memory`	`float`	Peak memory usage in GB

generate() parameters

body.model

nn.Module

required

The loaded model returned by load().

body.processor

PreTrainedTokenizer

required

The processor returned by load().

body.prompt

string

required

The formatted prompt string. Use apply_chat_template() to produce this value.

body.image

string | list[string] | list[PIL.Image.Image]

default:"None"

One or more images passed as file paths, URLs, or PIL.Image.Image objects.

body.audio

string | list[string]

default:"None"

One or more audio files passed as local paths or URLs.

body.verbose

boolean

default:"False"

When True, prints each token and a timing summary to stdout as generation proceeds.

body.max_tokens

integer

default:"256"

Maximum number of tokens to generate.

body.temperature

float

default:"0.0"

Sampling temperature. Set to 0 for deterministic greedy decoding.

body.top_p

float

default:"1.0"

Nucleus sampling threshold. The model samples from the smallest set of tokens whose cumulative probability exceeds this value.

body.top_k

integer

default:"0"

Restrict sampling to the top-k most probable tokens. 0 disables top-k filtering.

body.min_p

float

default:"0.0"

Minimum probability threshold relative to the top token. Tokens below this threshold are discarded.

body.repetition_penalty

float

default:"None"

Penalty applied to tokens that have already appeared in the context. Values above 1.0 discourage repetition.

body.repetition_context_size

integer

default:"20"

Number of preceding tokens to consider when applying the repetition penalty.

body.max_kv_size

integer

default:"None"

Cap the KV cache to this number of tokens. Useful for very long prompts.

body.kv_bits

integer

default:"None"

Quantize the KV cache to this many bits to reduce memory usage.

body.prefill_step_size

integer

default:"2048"

Number of tokens processed per prefill chunk. Lower values reduce peak memory at the cost of slower prefill.

body.resize_shape

integer | tuple[int, int]

default:"None"

Resize images to this shape before processing. Pass a single integer for a square crop or a (height, width) tuple.

Get Started

Inference

Fine-Tuning

Advanced

Models

Step-by-step walkthrough

Usage examples

Streaming with stream_generate

generate() parameters

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Step-by-step walkthrough

​Usage examples

​Streaming with stream_generate

​generate() parameters

Build docs developers (and LLMs) love

Step-by-step walkthrough

Usage examples

Streaming with stream_generate

generate() parameters