float32 or bfloat16) to lower bit widths such as 4-bit or 8-bit integers. This reduces memory usage and increases inference throughput, often with minimal impact on output quality.
MLX-VLM supports two types of quantization:
- Weight quantization — applied at conversion time, stored in the model weights on disk.
- Activation quantization — applied at inference time, required for certain quantized models running on NVIDIA GPUs.
Activation quantization
Activation quantization is required when running models that were quantized withmxfp8 or nvfp4 modes on NVIDIA GPUs via MLX CUDA. It converts QuantizedLinear layers to QQLinear layers, which quantize activations in addition to weights during the forward pass.
On Apple Silicon (Metal), models quantized with
mxfp8 or nvfp4 run correctly without enabling activation quantization. This flag is only needed for NVIDIA GPUs with MLX CUDA.Supported modes
| Mode | Description |
|---|---|
mxfp8 | 8-bit MX floating point |
nvfp4 | 4-bit NVIDIA floating point |
Using the CLI
Pass the-qa or --quantize-activations flag to mlx_vlm.generate:
Using the Python API
Passquantize_activations=True to the load function: