convert()
Download a model from Hugging Face (or load from a local path), convert it to MLX .safetensors format, optionally quantize the weights, and save the result to disk. Processor files, configuration, and a model card are written alongside the weights.
Signature
Parameters
Hugging Face repository ID (e.g.
"Qwen/Qwen2-VL-2B-Instruct") or local directory path containing the model.Destination directory for the converted MLX model. Created if it does not exist.
Quantize the model weights after conversion. Use
q_bits and q_group_size to control quantization granularity. Cannot be combined with dequantize=True.Group size for weight quantization. A smaller value gives finer-grained quantization at the cost of slightly larger files.
Number of bits per weight for quantization. Common values:
4 (4-bit), 8 (8-bit).Quantization mode. One of
"affine", "mxfp4", "nvfp4", or "mxfp8". "affine" is the standard integer quantization mode.Cast floating-point weights to this dtype before saving. One of
"float16", "bfloat16", or "float32". When None, the dtype is read from config.json’s torch_dtype field.Hugging Face repository ID to upload the converted model to (e.g.
"my-org/Qwen2-VL-2B-Instruct-4bit-mlx"). Creates the repo if it does not exist.Hugging Face revision (branch name, tag, or commit hash) to download. Defaults to
main.Dequantize a previously quantized model back to full precision. Cannot be combined with
quantize=True.Allow execution of custom model code included in the repository.
Named mixed-bit quantization recipe. When provided, different layers receive different bit-widths. Available recipes:
“High” bits are applied to
| Recipe | Description |
|---|---|
mixed_2_6 | 2-bit low / 6-bit high |
mixed_3_4 | 3-bit low / 4-bit high |
mixed_3_5 | 3-bit low / 5-bit high |
mixed_3_6 | 3-bit low / 6-bit high |
mixed_3_8 | 3-bit low / 8-bit high |
mixed_4_6 | 4-bit low / 6-bit high |
mixed_4_8 | 4-bit low / 8-bit high |
v_proj, down_proj, lm_head, and embed_tokens layers and to the first and last ⅛ of transformer layers. All other linear layers receive “low” bits. Vision and audio modules are skipped.Output
Afterconvert() completes, mlx_path contains:
model.safetensors(or shardedmodel-00001-of-NNNNN.safetensorsfiles)model.safetensors.index.jsonconfig.json- Processor files (
tokenizer.json,preprocessor_config.json, etc.) README.md(model card with provenance info)
Examples
CLI equivalent
convert() is also available from the command line:
Vision and audio encoder layers are always skipped during quantization. Only the language model’s linear layers are quantized.