Basic usage
Prepare your dataset
Create or identify a Hugging Face dataset with
images and messages columns in the format your target model expects. See Dataset preparation for details.CLI reference
Model arguments
| Argument | Default | Description |
|---|---|---|
--model-path | mlx-community/Qwen2-VL-2B-Instruct-bf16 | Path or Hub ID of the base model to fine-tune. |
--full-finetune | — | Update all model weights instead of using LoRA adapters. |
--train-vision | — | Unfreeze and train the vision encoder alongside the language model. |
Dataset arguments
| Argument | Default | Description |
|---|---|---|
--dataset | (required) | Local path or Hugging Face dataset identifier. |
--split | train | Dataset split to use. |
--dataset-config | — | Dataset configuration name (for datasets with multiple configs). |
--image-resize-shape | — | Resize all images to a fixed shape, e.g. 768 768. |
--custom-prompt-format | — | JSON template for datasets with question/answer columns instead of messages. |
Training arguments
| Argument | Default | Description |
|---|---|---|
--learning-rate | 2e-5 | Optimizer learning rate. |
--batch-size | 4 | Number of samples per training step. |
--iters | 1000 | Total training iterations. Ignored if --epochs is set. |
--epochs | — | Number of full passes over the dataset. Overrides --iters. |
--steps-per-report | 10 | Log loss and throughput every N steps. |
--steps-per-eval | 200 | Run validation every N steps. |
--steps-per-save | 100 | Save a checkpoint every N steps. |
--val-batches | 25 | Number of batches used for each validation run. |
--max-seq-length | 2048 | Maximum token sequence length; longer sequences are truncated. |
--grad-checkpoint | — | Enable gradient checkpointing to reduce peak memory (slightly slower). |
--grad-clip | — | Clip gradients to this maximum norm. |
--train-on-completions | — | Compute loss only on assistant responses, not on the prompt. |
--gradient-accumulation-steps | 1 | Accumulate gradients over N batches before updating weights. |
--assistant-id | 77091 | Token ID used to identify the start of assistant turns (for completion masking). |
LoRA arguments
| Argument | Default | Description |
|---|---|---|
--lora-rank | 8 | Rank of the LoRA decomposition matrices. Higher values increase adapter expressiveness. |
--lora-alpha | 16 | Scaling factor applied to the LoRA updates. Effective learning rate scales with lora-alpha / lora-rank. |
--lora-dropout | 0.0 | Dropout probability applied to LoRA layers during training. |
Output arguments
| Argument | Default | Description |
|---|---|---|
--output-path | adapters.safetensors | File path where the trained adapter is saved. |
--adapter-path | — | Path to an existing adapter to resume training from. |
Training examples
Python API
You can drive training programmatically by constructing anargparse.Namespace and calling main:
- Basic LoRA
- QLoRA
- Full fine-tuning
- Resume training
Training output
The script logs progress at the interval set by--steps-per-report. Each report includes:
- Current step and total steps
- Loss at the current step
- Running average loss
- Throughput in tokens/sec
- Estimated time remaining
--output-path.
Training tips
Memory optimization
Memory optimization
- Enable
--grad-checkpointto reduce peak memory at the cost of slightly longer training time. - Reduce
--batch-sizeto1or2if you run out of memory. - Use
--gradient-accumulation-stepsto maintain an equivalent effective batch size without holding more activations in memory (e.g.,--batch-size 1 --gradient-accumulation-steps 8approximates a batch size of 8). - Use QLoRA with a 4-bit model checkpoint for the lowest memory footprint.
Convergence and quality
Convergence and quality
- Start with learning rates in the range
1e-5to2e-5for LoRA. QLoRA often benefits from slightly higher rates (2e-4) because the base model is already compressed. - Increase
--lora-rankto16or32for more expressive adapters on complex tasks. Higher rank increases parameter count and memory use. - Use
--train-on-completionsto mask the prompt tokens from the loss — this focuses training on the model’s output quality and often improves convergence on instruction-following tasks. - Add
--train-visiononly when the task requires the model to understand visual features it wasn’t trained on (e.g., domain-specific imagery like medical scans or satellite data). - Monitor
--steps-per-evalvalidation loss to detect overfitting early.
Hardware-specific guidance
Hardware-specific guidance
- On Apple Silicon, MLX automatically utilizes the unified memory architecture and maps operations to the GPU and Neural Engine. No additional configuration is needed.
- For models larger than 11B parameters, always enable
--grad-checkpoint. - Use
--image-resize-shapeto cap image resolution and reduce the sequence length fed into the vision encoder, which directly reduces memory and speeds up training. - Larger batch sizes improve GPU utilization up to a point; if you have memory headroom, increasing
--batch-sizeis often more effective than increasing--gradient-accumulation-steps.