load()
Load a model and processor from a local directory or Hugging Face Hub repo. Downloads the model automatically if it is not already cached locally.
Signature
Parameters
Local directory path or Hugging Face repository ID (e.g.
"mlx-community/Qwen2-VL-2B-Instruct-4bit").Path to LoRA adapter weights. When provided, LoRA layers are applied on top of the base model.
When
False, all model parameters are evaluated (loaded into memory) before the function returns. When True, parameters are loaded on first use. Set to True for large models when you want to defer memory allocation.A Hugging Face revision identifier: a branch name, tag, or commit hash. Defaults to
main.Convert
QuantizedLinear layers to QQLinear layers for activation quantization. Required when running mxfp8 or nvfp4 quantized models on NVIDIA CUDA. Has no effect on Apple Silicon (Metal).Force re-download from Hugging Face Hub even if the model is already cached.
Allow execution of custom model code from the repository. Required for some models that ship non-standard architectures.
Returns
A tuple of(model, processor):
The loaded MLX model, ready for inference.
The model’s processor (tokenizer + image processor). Includes a
detokenizer attribute added by MLX-VLM for streaming decoding.Raises
FileNotFoundError— Config file or.safetensorsweight files not found at the given path.ValueError— Model class or model args class cannot be found or instantiated.
Examples
quantize_activations=True is only needed for models quantized with mxfp8 or nvfp4 modes when running on CUDA. On Apple Silicon (Metal), those models work without the flag.load_config()
Load the model configuration from a local directory or Hugging Face repo.
Signature
Parameters
Local path to a model directory or a Hugging Face repository ID. If a string is passed that is not a local path, the model is first downloaded.
Returns
The parsed
config.json for the model, with any eos_token_id overrides from generation_config.json applied.Example
prepare_inputs()
Tokenize prompts and preprocess images/audio into model-ready tensors.
Signature
Parameters
The model processor returned by
load().A list of image paths, URLs, or
PIL.Image.Image objects.A list of audio file paths or URLs.
The formatted prompt string or list of prompt strings.
Token index used to mark image positions in the input IDs. Read from
model.config.image_token_index.Resize all images to
(height, width) before processing.Pass
add_special_tokens to the tokenizer.Pad tokenized inputs to the same length when processing a batch.
Side to apply padding (
"left" or "right"). Left padding is standard for causal generation.When
True, resize all images to a uniform size derived from the image processor configuration. Used for batched image inputs.Tensor format to return. Always
"mlx" for MLX arrays.Returns
A dictionary containing:Token IDs of shape
(batch, seq_len).Attention mask of shape
(batch, seq_len).Preprocessed image pixels (present when images are provided).
process_image()
Load and optionally resize a single image.
Signature
Parameters
An image URL, local file path, or a
PIL.Image.Image object.Target
(max_width, max_height) to resize the image to. Pass None to skip resizing.The model’s image processor. When a custom
BaseImageProcessor is provided, resizing is handled by that processor instead.Returns
The loaded, RGB-converted, and optionally resized image.