Fine-Tuning Overview

Fine-tuning lets you specialize a pre-trained VLM on your own data — teaching it domain-specific vocabulary, image types, output formats, or reasoning patterns that a general-purpose model wouldn’t handle well out of the box. MLX-VLM provides a lora.py script backed by the MLX trainer, making fine-tuning fast and memory-efficient on Apple Silicon Macs.

Approaches

LoRA (Low-Rank Adaptation)

LoRA freezes the original model weights and injects small trainable rank-decomposition matrices into the attention layers. You train only these adapter parameters — typically less than 1% of total weights — and save the result as a lightweight .safetensors adapter file. This is the default and recommended approach for most use cases.

QLoRA (Quantized LoRA)

QLoRA combines LoRA with a quantized base model (e.g., a 4-bit checkpoint). The base model weights stay quantized and frozen; the LoRA adapters are trained in full precision. This significantly reduces memory usage, making it practical to fine-tune larger models on Mac hardware with limited unified memory. To use QLoRA, point --model-path at a quantized model checkpoint and run the training script normally:

python lora.py \
    --model-path mlx-community/Qwen3-VL-2B-Instruct-4bit \
    --dataset your-dataset-id \
    --batch-size 4 \
    --epochs 2

Full fine-tuning

Pass --full-finetune to update all model weights instead of inserting LoRA adapters. This requires substantially more memory and is slower, but gives the model maximum capacity to adapt. You can optionally add --train-vision to also update the vision encoder weights.

Full fine-tuning with large models requires significant unified memory. Enable gradient checkpointing (--grad-checkpoint) and use a batch size of 1 to reduce peak memory usage.

MLX trainer backend

The training script uses the MLX trainer, which provides:

Efficient execution on Apple Silicon — MLX automatically maps operations to the M-series GPU/Neural Engine.
Automatic mixed precision — reduces memory footprint without sacrificing training quality.
Gradient checkpointing — recomputes activations during the backward pass to trade compute for memory; enable with --grad-checkpoint.
Gradient accumulation — simulate larger batch sizes by accumulating gradients over multiple steps before updating weights; set with --gradient-accumulation-steps.
Hugging Face dataset integration — load any dataset directly by its Hub identifier or local path.

Supported models

Fine-tuning is supported for all models except Gemma3n and Qwen3 Omni. This includes:

Qwen2-VL, Qwen2.5-VL, Qwen3-VL
LLaVA and LLaVA-Next variants
Deepseek-VL and Deepseek-VL-V2
Mllama (Llama-3.2-Vision)
Pixtral
Idefics3
SmolVLM

Requirements

Python 3.7+
mlx-vlm
mlx
numpy
transformers
datasets
PIL (Pillow)

Install all dependencies with:

pip install -U mlx-vlm

Next steps

LoRA & QLoRA training

Complete CLI reference, training examples, and Python API for running LoRA and QLoRA jobs.

Dataset preparation

Required dataset format, per-model message structures, and how to build a dataset programmatically.

Get Started

Inference

Fine-Tuning

Advanced

Models

Fine-Tuning Overview

Approaches

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

Full fine-tuning

MLX trainer backend

Supported models

Requirements

Next steps

LoRA & QLoRA training

Dataset preparation

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Approaches

​LoRA (Low-Rank Adaptation)

​QLoRA (Quantized LoRA)

​Full fine-tuning

​MLX trainer backend

​Supported models

​Requirements

​Next steps

LoRA & QLoRA training

Dataset preparation

Build docs developers (and LLMs) love

Approaches

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

Full fine-tuning

MLX trainer backend

Supported models

Requirements

Next steps