Skip to main content

convert()

Download a model from Hugging Face (or load from a local path), convert it to MLX .safetensors format, optionally quantize the weights, and save the result to disk. Processor files, configuration, and a model card are written alongside the weights.
from mlx_vlm.convert import convert

convert("mlx-community/Qwen2-VL-2B-Instruct", mlx_path="./qwen2-vl-mlx")

Signature

def convert(
    hf_path: str,
    mlx_path: str = "mlx_model",
    quantize: bool = False,
    q_group_size: int = 64,
    q_bits: int = 4,
    q_mode: str = "affine",
    dtype: Optional[str] = None,
    upload_repo: str = None,
    revision: Optional[str] = None,
    dequantize: bool = False,
    trust_remote_code: bool = True,
    quant_predicate: Optional[str] = None,
):

Parameters

hf_path
str
required
Hugging Face repository ID (e.g. "Qwen/Qwen2-VL-2B-Instruct") or local directory path containing the model.
mlx_path
str
default:"\"mlx_model\""
Destination directory for the converted MLX model. Created if it does not exist.
quantize
bool
default:"False"
Quantize the model weights after conversion. Use q_bits and q_group_size to control quantization granularity. Cannot be combined with dequantize=True.
q_group_size
int
default:"64"
Group size for weight quantization. A smaller value gives finer-grained quantization at the cost of slightly larger files.
q_bits
int
default:"4"
Number of bits per weight for quantization. Common values: 4 (4-bit), 8 (8-bit).
q_mode
str
default:"\"affine\""
Quantization mode. One of "affine", "mxfp4", "nvfp4", or "mxfp8". "affine" is the standard integer quantization mode.
dtype
str | None
default:"None"
Cast floating-point weights to this dtype before saving. One of "float16", "bfloat16", or "float32". When None, the dtype is read from config.json’s torch_dtype field.
upload_repo
str | None
default:"None"
Hugging Face repository ID to upload the converted model to (e.g. "my-org/Qwen2-VL-2B-Instruct-4bit-mlx"). Creates the repo if it does not exist.
revision
str | None
default:"None"
Hugging Face revision (branch name, tag, or commit hash) to download. Defaults to main.
dequantize
bool
default:"False"
Dequantize a previously quantized model back to full precision. Cannot be combined with quantize=True.
trust_remote_code
bool
default:"True"
Allow execution of custom model code included in the repository.
quant_predicate
str | None
default:"None"
Named mixed-bit quantization recipe. When provided, different layers receive different bit-widths. Available recipes:
RecipeDescription
mixed_2_62-bit low / 6-bit high
mixed_3_43-bit low / 4-bit high
mixed_3_53-bit low / 5-bit high
mixed_3_63-bit low / 6-bit high
mixed_3_83-bit low / 8-bit high
mixed_4_64-bit low / 6-bit high
mixed_4_84-bit low / 8-bit high
“High” bits are applied to v_proj, down_proj, lm_head, and embed_tokens layers and to the first and last ⅛ of transformer layers. All other linear layers receive “low” bits. Vision and audio modules are skipped.

Output

After convert() completes, mlx_path contains:
  • model.safetensors (or sharded model-00001-of-NNNNN.safetensors files)
  • model.safetensors.index.json
  • config.json
  • Processor files (tokenizer.json, preprocessor_config.json, etc.)
  • README.md (model card with provenance info)

Examples

from mlx_vlm.convert import convert

convert(
    "Qwen/Qwen2-VL-2B-Instruct",
    mlx_path="./qwen2-vl-mlx",
)

CLI equivalent

convert() is also available from the command line:
# Basic conversion
python -m mlx_vlm convert --hf-path Qwen/Qwen2-VL-2B-Instruct --mlx-path ./qwen2-vl-mlx

# 4-bit quantization
python -m mlx_vlm convert \
  --hf-path Qwen/Qwen2-VL-2B-Instruct \
  --mlx-path ./qwen2-vl-4bit \
  -q --q-bits 4 --q-group-size 64

# Mixed-bit recipe
python -m mlx_vlm convert \
  --hf-path Qwen/Qwen2-VL-7B-Instruct \
  --mlx-path ./qwen2-vl-7b-mixed \
  -q --quant-predicate mixed_3_6

# Upload to Hub after conversion
python -m mlx_vlm convert \
  --hf-path Qwen/Qwen2-VL-2B-Instruct \
  --mlx-path ./qwen2-vl-4bit \
  -q --upload-repo my-org/Qwen2-VL-2B-Instruct-4bit-mlx
Vision and audio encoder layers are always skipped during quantization. Only the language model’s linear layers are quantized.
quantize=True and dequantize=True are mutually exclusive. Passing both raises a ValueError.

Build docs developers (and LLMs) love