Quantization

Quantization converts a model’s weights from full-precision floating point (e.g. float32 or bfloat16) to lower bit widths such as 4-bit or 8-bit integers. This reduces memory usage and increases inference throughput, often with minimal impact on output quality. MLX-VLM supports two types of quantization:

Weight quantization — applied at conversion time, stored in the model weights on disk.
Activation quantization — applied at inference time, required for certain quantized models running on NVIDIA GPUs.

Activation quantization

Activation quantization is required when running models that were quantized with mxfp8 or nvfp4 modes on NVIDIA GPUs via MLX CUDA. It converts QuantizedLinear layers to QQLinear layers, which quantize activations in addition to weights during the forward pass.

On Apple Silicon (Metal), models quantized with mxfp8 or nvfp4 run correctly without enabling activation quantization. This flag is only needed for NVIDIA GPUs with MLX CUDA.

Supported modes

Mode	Description
`mxfp8`	8-bit MX floating point
`nvfp4`	4-bit NVIDIA floating point

Using the CLI

Pass the -qa or --quantize-activations flag to mlx_vlm.generate:

mlx_vlm.generate \
  --model /path/to/mxfp8-model \
  --prompt "Describe this image" \
  --image /path/to/image.jpg \
  -qa

Using the Python API

Pass quantize_activations=True to the load function:

from mlx_vlm import load, generate

model, processor = load(
    "path/to/mxfp8-quantized-model",
    quantize_activations=True
)

output = generate(model, processor, "Describe this image", image=["image.jpg"])
print(output)

If you load a model quantized with mxfp8 or nvfp4 on an NVIDIA GPU without this flag, inference will fail. Layers quantized with other modes (e.g. standard affine 4-bit) do not support activation quantization and will raise an error if the flag is used.

Weight quantization

Weight quantization is applied when converting a Hugging Face model to MLX format. You control the bit width, group size, and quantization recipe at conversion time. See the Model Conversion page for the full set of conversion options, including mixed-bit quantization recipes and per-component settings.

Get Started

Inference

Fine-Tuning

Advanced

Models

Activation quantization

Supported modes

Using the CLI

Using the Python API

Weight quantization

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Activation quantization

​Supported modes

​Using the CLI

​Using the Python API

​Weight quantization

Build docs developers (and LLMs) love

Activation quantization

Supported modes

Using the CLI

Using the Python API

Weight quantization