Skip to main content
LoRA (Low-Rank Adaptation) injects small trainable matrices into the model’s attention layers while keeping the original weights frozen. Because only the adapter parameters are updated, training requires far less memory and time than full fine-tuning, and the resulting adapter file is a fraction of the full model size. QLoRA extends this by loading the base model in a quantized format (e.g., 4-bit), keeping it frozen in low precision while training the LoRA adapters in full precision — trading a small amount of accuracy for a large reduction in memory usage.

Basic usage

1

Prepare your dataset

Create or identify a Hugging Face dataset with images and messages columns in the format your target model expects. See Dataset preparation for details.
2

Run the training script

Call lora.py with at minimum --model-path and --dataset:
python lora.py \
    --model-path mlx-community/Qwen3-VL-2B-Instruct-bf16 \
    --dataset your-huggingface-dataset-id
3

Use your adapter

After training, load the saved adapter alongside the base model for inference:
mlx_vlm.generate \
    --model mlx-community/Qwen3-VL-2B-Instruct-bf16 \
    --adapter-path ./adapters.safetensors \
    --prompt "Describe this image" \
    --image /path/to/image.jpg

CLI reference

Model arguments

ArgumentDefaultDescription
--model-pathmlx-community/Qwen2-VL-2B-Instruct-bf16Path or Hub ID of the base model to fine-tune.
--full-finetuneUpdate all model weights instead of using LoRA adapters.
--train-visionUnfreeze and train the vision encoder alongside the language model.

Dataset arguments

ArgumentDefaultDescription
--dataset(required)Local path or Hugging Face dataset identifier.
--splittrainDataset split to use.
--dataset-configDataset configuration name (for datasets with multiple configs).
--image-resize-shapeResize all images to a fixed shape, e.g. 768 768.
--custom-prompt-formatJSON template for datasets with question/answer columns instead of messages.

Training arguments

ArgumentDefaultDescription
--learning-rate2e-5Optimizer learning rate.
--batch-size4Number of samples per training step.
--iters1000Total training iterations. Ignored if --epochs is set.
--epochsNumber of full passes over the dataset. Overrides --iters.
--steps-per-report10Log loss and throughput every N steps.
--steps-per-eval200Run validation every N steps.
--steps-per-save100Save a checkpoint every N steps.
--val-batches25Number of batches used for each validation run.
--max-seq-length2048Maximum token sequence length; longer sequences are truncated.
--grad-checkpointEnable gradient checkpointing to reduce peak memory (slightly slower).
--grad-clipClip gradients to this maximum norm.
--train-on-completionsCompute loss only on assistant responses, not on the prompt.
--gradient-accumulation-steps1Accumulate gradients over N batches before updating weights.
--assistant-id77091Token ID used to identify the start of assistant turns (for completion masking).

LoRA arguments

ArgumentDefaultDescription
--lora-rank8Rank of the LoRA decomposition matrices. Higher values increase adapter expressiveness.
--lora-alpha16Scaling factor applied to the LoRA updates. Effective learning rate scales with lora-alpha / lora-rank.
--lora-dropout0.0Dropout probability applied to LoRA layers during training.

Output arguments

ArgumentDefaultDescription
--output-pathadapters.safetensorsFile path where the trained adapter is saved.
--adapter-pathPath to an existing adapter to resume training from.

Training examples

python lora.py \
    --model-path mlx-community/Qwen3-VL-2B-Instruct-bf16 \
    --dataset your-huggingface-dataset-id \
    --batch-size 2 \
    --epochs 2 \
    --learning-rate 2e-5 \
    --output-path ./qwen3-lora-adapter.safetensors

Python API

You can drive training programmatically by constructing an argparse.Namespace and calling main:
import argparse
from mlx_vlm.lora import main

args = argparse.Namespace(
    model_path="mlx-community/Qwen3-VL-2B-Instruct-bf16",
    dataset="your-huggingface-dataset-id",
    split="train",
    dataset_config=None,
    batch_size=2,
    epochs=2,
    learning_rate=2e-5,
    iters=1000,
    steps_per_report=10,
    steps_per_eval=200,
    steps_per_save=100,
    val_batches=25,
    max_seq_length=2048,
    lora_rank=8,
    lora_alpha=16,
    lora_dropout=0.0,
    output_path="./qwen3-lora-adapter.safetensors",
    adapter_path=None,
    full_finetune=False,
    train_vision=False,
    grad_checkpoint=False,
    grad_clip=None,
    train_on_completions=False,
    gradient_accumulation_steps=1,
    assistant_id=77091,
    image_resize_shape=None,
    custom_prompt_format=None,
)

main(args)

Training output

The script logs progress at the interval set by --steps-per-report. Each report includes:
  • Current step and total steps
  • Loss at the current step
  • Running average loss
  • Throughput in tokens/sec
  • Estimated time remaining
After training completes, the adapter is saved to --output-path.

Training tips

  • Enable --grad-checkpoint to reduce peak memory at the cost of slightly longer training time.
  • Reduce --batch-size to 1 or 2 if you run out of memory.
  • Use --gradient-accumulation-steps to maintain an equivalent effective batch size without holding more activations in memory (e.g., --batch-size 1 --gradient-accumulation-steps 8 approximates a batch size of 8).
  • Use QLoRA with a 4-bit model checkpoint for the lowest memory footprint.
  • Start with learning rates in the range 1e-5 to 2e-5 for LoRA. QLoRA often benefits from slightly higher rates (2e-4) because the base model is already compressed.
  • Increase --lora-rank to 16 or 32 for more expressive adapters on complex tasks. Higher rank increases parameter count and memory use.
  • Use --train-on-completions to mask the prompt tokens from the loss — this focuses training on the model’s output quality and often improves convergence on instruction-following tasks.
  • Add --train-vision only when the task requires the model to understand visual features it wasn’t trained on (e.g., domain-specific imagery like medical scans or satellite data).
  • Monitor --steps-per-eval validation loss to detect overfitting early.
  • On Apple Silicon, MLX automatically utilizes the unified memory architecture and maps operations to the GPU and Neural Engine. No additional configuration is needed.
  • For models larger than 11B parameters, always enable --grad-checkpoint.
  • Use --image-resize-shape to cap image resolution and reduce the sequence length fed into the vision encoder, which directly reduces memory and speeds up training.
  • Larger batch sizes improve GPU utilization up to a point; if you have memory headroom, increasing --batch-size is often more effective than increasing --gradient-accumulation-steps.

Build docs developers (and LLMs) love