mlx_vlm.convert command. Conversion downloads the model weights, casts them to the target dtype, optionally quantizes them, and writes the result to a local directory.
Basic conversion
Convert the model
./pixtral-12b-mlx.CLI reference
| Flag | Type | Default | Description |
|---|---|---|---|
--hf-path, --model | string | — | Hugging Face repo ID or local path to the source model |
--mlx-path | string | mlx_model | Directory to write the converted MLX model |
-q, --quantize | flag | false | Quantize the model weights |
--q-bits | int | 4 | Bits per weight for quantization |
--q-group-size | int | 64 | Group size for quantization |
--q-mode | string | affine | Quantization mode: affine, mxfp4, nvfp4, mxfp8 |
--quant-predicate | string | — | Mixed-bit quantization recipe (see Mixed quantization) |
--dtype | string | from config | Cast weights to float16, bfloat16, or float32 |
--upload-repo | string | — | Hugging Face repo to upload the converted model to |
--revision | string | — | Branch, tag, or commit to use from the Hugging Face Hub |
-d, --dequantize | flag | false | Dequantize a previously quantized model |
--trust-remote-code | flag | false | Allow running custom model code from the repository |
--quantize and --dequantize are mutually exclusive. Using both at once raises an error.Common examples
Python API
You can also run conversion from Python:convert().
Mixed quantization
Mixed quantization assigns different bit widths to different layers in the model. Layers near the input and output (where precision matters most) receive more bits; middle layers receive fewer. This follows the same strategy as formats likeQ4_K_M in llama.cpp.
The --quant-predicate flag accepts one of the following recipes:
| Recipe | Low bits | High bits |
|---|---|---|
mixed_2_6 | 2 | 6 |
mixed_3_4 | 3 | 4 |
mixed_3_5 | 3 | 5 |
mixed_3_6 | 3 | 6 |
mixed_3_8 | 3 | 8 |
mixed_4_6 | 4 | 6 |
mixed_4_8 | 4 | 8 |
v_proj and down_proj layers in the first and last eighth of the model, as well as lm_head and embed_tokens. All other quantizable layers use the low-bit setting.
By default, the vision encoder is excluded from quantization. The
skip_multimodal_module predicate skips any path containing vision_model, vision_tower, vl_connector, audio_model, or audio_tower.Uploading to Hugging Face Hub
After conversion, you can push the model directly to your Hugging Face account:--upload-repo value should be the target Hugging Face repo in owner/name format. The CLI will upload all files in --mlx-path to that repository after conversion completes.