LoRA Training for SDXL

Overview

sdxl_train_network.py is the LoRA training script for Stable Diffusion XL (SDXL). It shares most arguments with train_network.py but adds SDXL-specific features, including separate learning rates for the two text encoders and options to cache outputs and reduce VRAM usage.

Read LoRA Training for SD 1.x/2.x first for an explanation of the shared arguments. This page focuses on SDXL differences.

Key differences from SD 1.x/2.x training

Aspect	SD 1.x/2.x	SDXL
Script	`train_network.py`	`sdxl_train_network.py`
Text encoders	1 (CLIP ViT-L)	2 (OpenCLIP ViT-G/14 + CLIP ViT-L/14)
Text encoder LR arg	`--text_encoder_lr`	`--text_encoder_lr1` + `--text_encoder_lr2`
Recommended precision	`fp16`	`bf16` preferred, `fp16` with `--no_half_vae`
Typical resolution	512px	1024px
VAE stability	Generally stable	May be unstable in float16
`--v2` / `--v_parameterization`	Required for v2.x	Not used
`--clip_skip`	Optional	Not used

Prerequisites

A Stable Diffusion XL base model. You can use stabilityai/stable-diffusion-xl-base-1.0 from Hugging Face or a local .safetensors file.
A dataset prepared at a higher resolution (1024×1024 is standard). Enable aspect ratio bucketing in your TOML config with enable_bucket = true.

Training command

accelerate launch --num_cpu_threads_per_process 1 sdxl_train_network.py \
  --pretrained_model_name_or_path="/path/to/sdxl-base.safetensors" \
  --dataset_config="my_sdxl_dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_sdxl_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora \
  --network_dim=32 \
  --network_alpha=16 \
  --learning_rate=1e-4 \
  --unet_lr=1e-4 \
  --text_encoder_lr1=1e-5 \
  --text_encoder_lr2=1e-5 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="cosine_with_restarts" \
  --lr_warmup_steps=100 \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --cache_latents \
  --cache_text_encoder_outputs

When using --cache_text_encoder_outputs, caption augmentations (such as --shuffle_caption) are disabled and you cannot train text encoder LoRA modules. Add --network_train_unet_only to avoid an error.

SDXL-specific arguments

Dual text encoder learning rates

--text_encoder_lr1

number

Learning rate for LoRA modules in Text Encoder 1 (OpenCLIP ViT-G/14). Defaults to --learning_rate when omitted. Recommend a lower value than the U-Net (e.g., 1e-5).

--text_encoder_lr2

number

Learning rate for LoRA modules in Text Encoder 2 (CLIP ViT-L/14). Defaults to --learning_rate when omitted. Recommend a lower value than the U-Net (e.g., 1e-5).

VAE stability

--no_half_vae

boolean

Runs the VAE in float32 even when mixed precision is fp16 or bf16. The SDXL VAE can produce NaN values in float16. Always add this flag when using --mixed_precision=fp16.

Caching

--cache_latents

boolean

Pre-encodes all training images with the VAE and stores them in memory. Speeds up training and reduces VRAM because the VAE is not run during each step. Disables image augmentations (flip, color, random crop).

--cache_latents_to_disk

boolean

Like --cache_latents, but writes the cache to disk. On subsequent runs, the script loads the cache instead of re-encoding. Useful for large datasets.

--cache_text_encoder_outputs

boolean

Pre-computes text encoder outputs and stores them in memory. Significantly reduces VRAM. Disables caption augmentations and text encoder LoRA training. Requires --network_train_unet_only.

--cache_text_encoder_outputs_to_disk

boolean

Like --cache_text_encoder_outputs, but writes to disk.

Experimental options

--fused_backward_pass

boolean

Fuses gradient computation with the optimizer step to save VRAM. Currently only supported with Adafactor. Cannot be combined with gradient accumulation.

Dataset TOML for SDXL

Enable aspect ratio bucketing so the trainer can handle non-square crops at 1024px:

[general]
shuffle_caption = true
caption_extension = ".txt"

[[datasets]]
resolution = 1024
batch_size = 2
enable_bucket = true
bucket_reso_steps = 64
min_bucket_reso = 512
max_bucket_reso = 2048

  [[datasets.subsets]]
  image_dir = "/path/to/images"
  num_repeats = 10

bucket_reso_steps must be a multiple of 32 for SDXL. Using 64 is the recommended default.

Recommended settings by VRAM

GPU VRAM	Suggested adjustments
24 GB	Default settings above work
16 GB	Add `--cache_latents --cache_text_encoder_outputs`, reduce batch size to 1
12 GB	Add `--gradient_accumulation_steps=2`, use `Adafactor` optimizer
8 GB	Add `--full_bf16`, reduce `--network_dim` to 16, increase gradient accumulation

bf16 precision is preferred over fp16 for SDXL because it has a wider dynamic range and avoids the VAE NaN issue without needing --no_half_vae.

Using the trained LoRA

When training completes, load the .safetensors file in any SDXL-compatible tool:

ComfyUI — use a LoraLoader node with the SDXL base checkpoint loaded.
AUTOMATIC1111 — place in models/Lora/ and reference with <lora:my_sdxl_lora:1> in your prompt. Make sure to load the SDXL checkpoint first.

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

LoRA Training for SDXL

Overview

Key differences from SD 1.x/2.x training

Prerequisites

Training command

SDXL-specific arguments

Dual text encoder learning rates

VAE stability

Caching

Experimental options

Dataset TOML for SDXL

Recommended settings by VRAM

Using the trained LoRA

Build docs developers (and LLMs) love

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

​Overview

​Key differences from SD 1.x/2.x training

​Prerequisites

​Training command

​SDXL-specific arguments

​Dual text encoder learning rates

​VAE stability

​Caching

​Experimental options

​Dataset TOML for SDXL

​Recommended settings by VRAM

​Using the trained LoRA

Build docs developers (and LLMs) love

Overview

Key differences from SD 1.x/2.x training

Prerequisites

Training command

SDXL-specific arguments

Dual text encoder learning rates

VAE stability

Caching

Experimental options

Dataset TOML for SDXL

Recommended settings by VRAM

Using the trained LoRA