Skip to main content
MLX-VLM works with models stored in MLX format. To use a model from Hugging Face that hasn’t been converted yet, run the mlx_vlm.convert command. Conversion downloads the model weights, casts them to the target dtype, optionally quantizes them, and writes the result to a local directory.
The mlx-community organization on Hugging Face hosts many pre-converted models. Check there before converting a model yourself.

Basic conversion

1

Install mlx-vlm

pip install -U mlx-vlm
2

Convert the model

mlx_vlm.convert --hf-path mistral-community/pixtral-12b --mlx-path ./pixtral-12b-mlx
This downloads the model from Hugging Face and saves the converted weights to ./pixtral-12b-mlx.
3

Use the converted model

mlx_vlm.generate \
  --model ./pixtral-12b-mlx \
  --prompt "What is in this image?" \
  --image /path/to/image.jpg

CLI reference

mlx_vlm.convert [OPTIONS]
FlagTypeDefaultDescription
--hf-path, --modelstringHugging Face repo ID or local path to the source model
--mlx-pathstringmlx_modelDirectory to write the converted MLX model
-q, --quantizeflagfalseQuantize the model weights
--q-bitsint4Bits per weight for quantization
--q-group-sizeint64Group size for quantization
--q-modestringaffineQuantization mode: affine, mxfp4, nvfp4, mxfp8
--quant-predicatestringMixed-bit quantization recipe (see Mixed quantization)
--dtypestringfrom configCast weights to float16, bfloat16, or float32
--upload-repostringHugging Face repo to upload the converted model to
--revisionstringBranch, tag, or commit to use from the Hugging Face Hub
-d, --dequantizeflagfalseDequantize a previously quantized model
--trust-remote-codeflagfalseAllow running custom model code from the repository
--quantize and --dequantize are mutually exclusive. Using both at once raises an error.

Common examples

mlx_vlm.convert \
  --hf-path mlx-community/Qwen2-VL-7B-Instruct \
  --mlx-path ./qwen2-vl-7b-4bit \
  --quantize \
  --q-bits 4 \
  --q-group-size 64

Python API

You can also run conversion from Python:
from mlx_vlm import convert

convert(
    hf_path="mlx-community/Qwen2-VL-7B-Instruct",
    mlx_path="./qwen2-vl-7b-4bit",
    quantize=True,
    q_bits=4,
    q_group_size=64,
)
All parameters from the CLI map directly to keyword arguments in convert().

Mixed quantization

Mixed quantization assigns different bit widths to different layers in the model. Layers near the input and output (where precision matters most) receive more bits; middle layers receive fewer. This follows the same strategy as formats like Q4_K_M in llama.cpp. The --quant-predicate flag accepts one of the following recipes:
RecipeLow bitsHigh bits
mixed_2_626
mixed_3_434
mixed_3_535
mixed_3_636
mixed_3_838
mixed_4_646
mixed_4_848
The high-bit setting applies to v_proj and down_proj layers in the first and last eighth of the model, as well as lm_head and embed_tokens. All other quantizable layers use the low-bit setting.
By default, the vision encoder is excluded from quantization. The skip_multimodal_module predicate skips any path containing vision_model, vision_tower, vl_connector, audio_model, or audio_tower.
mlx_vlm.convert \
  --hf-path mlx-community/Qwen2-VL-7B-Instruct \
  --mlx-path ./qwen2-vl-7b-mixed-4-8 \
  --quant-predicate mixed_4_8

Uploading to Hugging Face Hub

After conversion, you can push the model directly to your Hugging Face account:
mlx_vlm.convert \
  --hf-path mlx-community/Qwen2-VL-7B-Instruct \
  --mlx-path ./qwen2-vl-7b-4bit \
  --quantize \
  --q-bits 4 \
  --upload-repo your-username/Qwen2-VL-7B-Instruct-4bit-mlx
The --upload-repo value should be the target Hugging Face repo in owner/name format. The CLI will upload all files in --mlx-path to that repository after conversion completes.

Build docs developers (and LLMs) love