convert()

`convert()`

Download a model from Hugging Face (or load from a local path), convert it to MLX .safetensors format, optionally quantize the weights, and save the result to disk. Processor files, configuration, and a model card are written alongside the weights.

from mlx_vlm.convert import convert

convert("mlx-community/Qwen2-VL-2B-Instruct", mlx_path="./qwen2-vl-mlx")

Signature

def convert(
    hf_path: str,
    mlx_path: str = "mlx_model",
    quantize: bool = False,
    q_group_size: int = 64,
    q_bits: int = 4,
    q_mode: str = "affine",
    dtype: Optional[str] = None,
    upload_repo: str = None,
    revision: Optional[str] = None,
    dequantize: bool = False,
    trust_remote_code: bool = True,
    quant_predicate: Optional[str] = None,
):

Parameters

hf_path

str

required

Hugging Face repository ID (e.g. "Qwen/Qwen2-VL-2B-Instruct") or local directory path containing the model.

mlx_path

str

default:"\"mlx_model\""

Destination directory for the converted MLX model. Created if it does not exist.

quantize

bool

default:"False"

Quantize the model weights after conversion. Use q_bits and q_group_size to control quantization granularity. Cannot be combined with dequantize=True.

q_group_size

int

default:"64"

Group size for weight quantization. A smaller value gives finer-grained quantization at the cost of slightly larger files.

q_bits

int

default:"4"

Number of bits per weight for quantization. Common values: 4 (4-bit), 8 (8-bit).

q_mode

str

default:"\"affine\""

Quantization mode. One of "affine", "mxfp4", "nvfp4", or "mxfp8". "affine" is the standard integer quantization mode.

dtype

str | None

default:"None"

Cast floating-point weights to this dtype before saving. One of "float16", "bfloat16", or "float32". When None, the dtype is read from config.json’s torch_dtype field.

upload_repo

str | None

default:"None"

Hugging Face repository ID to upload the converted model to (e.g. "my-org/Qwen2-VL-2B-Instruct-4bit-mlx"). Creates the repo if it does not exist.

revision

str | None

default:"None"

Hugging Face revision (branch name, tag, or commit hash) to download. Defaults to main.

dequantize

bool

default:"False"

Dequantize a previously quantized model back to full precision. Cannot be combined with quantize=True.

trust_remote_code

bool

default:"True"

Allow execution of custom model code included in the repository.

quant_predicate

str | None

default:"None"

Named mixed-bit quantization recipe. When provided, different layers receive different bit-widths. Available recipes:

Recipe	Description
`mixed_2_6`	2-bit low / 6-bit high
`mixed_3_4`	3-bit low / 4-bit high
`mixed_3_5`	3-bit low / 5-bit high
`mixed_3_6`	3-bit low / 6-bit high
`mixed_3_8`	3-bit low / 8-bit high
`mixed_4_6`	4-bit low / 6-bit high
`mixed_4_8`	4-bit low / 8-bit high

“High” bits are applied to v_proj, down_proj, lm_head, and embed_tokens layers and to the first and last ⅛ of transformer layers. All other linear layers receive “low” bits. Vision and audio modules are skipped.

Output

After convert() completes, mlx_path contains:

model.safetensors (or sharded model-00001-of-NNNNN.safetensors files)
model.safetensors.index.json
config.json
Processor files (tokenizer.json, preprocessor_config.json, etc.)
README.md (model card with provenance info)

Examples

from mlx_vlm.convert import convert

convert(
    "Qwen/Qwen2-VL-2B-Instruct",
    mlx_path="./qwen2-vl-mlx",
)

CLI equivalent

convert() is also available from the command line:

# Basic conversion
python -m mlx_vlm convert --hf-path Qwen/Qwen2-VL-2B-Instruct --mlx-path ./qwen2-vl-mlx

# 4-bit quantization
python -m mlx_vlm convert \
  --hf-path Qwen/Qwen2-VL-2B-Instruct \
  --mlx-path ./qwen2-vl-4bit \
  -q --q-bits 4 --q-group-size 64

# Mixed-bit recipe
python -m mlx_vlm convert \
  --hf-path Qwen/Qwen2-VL-7B-Instruct \
  --mlx-path ./qwen2-vl-7b-mixed \
  -q --quant-predicate mixed_3_6

# Upload to Hub after conversion
python -m mlx_vlm convert \
  --hf-path Qwen/Qwen2-VL-2B-Instruct \
  --mlx-path ./qwen2-vl-4bit \
  -q --upload-repo my-org/Qwen2-VL-2B-Instruct-4bit-mlx

Vision and audio encoder layers are always skipped during quantization. Only the language model’s linear layers are quantized.

quantize=True and dequantize=True are mutually exclusive. Passing both raises a ValueError.

Python API

REST API

`convert()`

Signature

Parameters

Output

Examples

CLI equivalent

Build docs developers (and LLMs) love

Python API

REST API

​convert()

​Signature

​Parameters

​Output

​Examples

​CLI equivalent

Build docs developers (and LLMs) love

`convert()`

Signature

Parameters

Output

Examples

CLI equivalent