Skip to main content

Overview

sdxl_train_network.py is the LoRA training script for Stable Diffusion XL (SDXL). It shares most arguments with train_network.py but adds SDXL-specific features, including separate learning rates for the two text encoders and options to cache outputs and reduce VRAM usage.
Read LoRA Training for SD 1.x/2.x first for an explanation of the shared arguments. This page focuses on SDXL differences.

Key differences from SD 1.x/2.x training

AspectSD 1.x/2.xSDXL
Scripttrain_network.pysdxl_train_network.py
Text encoders1 (CLIP ViT-L)2 (OpenCLIP ViT-G/14 + CLIP ViT-L/14)
Text encoder LR arg--text_encoder_lr--text_encoder_lr1 + --text_encoder_lr2
Recommended precisionfp16bf16 preferred, fp16 with --no_half_vae
Typical resolution512px1024px
VAE stabilityGenerally stableMay be unstable in float16
--v2 / --v_parameterizationRequired for v2.xNot used
--clip_skipOptionalNot used

Prerequisites

  • A Stable Diffusion XL base model. You can use stabilityai/stable-diffusion-xl-base-1.0 from Hugging Face or a local .safetensors file.
  • A dataset prepared at a higher resolution (1024×1024 is standard). Enable aspect ratio bucketing in your TOML config with enable_bucket = true.

Training command

accelerate launch --num_cpu_threads_per_process 1 sdxl_train_network.py \
  --pretrained_model_name_or_path="/path/to/sdxl-base.safetensors" \
  --dataset_config="my_sdxl_dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_sdxl_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora \
  --network_dim=32 \
  --network_alpha=16 \
  --learning_rate=1e-4 \
  --unet_lr=1e-4 \
  --text_encoder_lr1=1e-5 \
  --text_encoder_lr2=1e-5 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="cosine_with_restarts" \
  --lr_warmup_steps=100 \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --cache_latents \
  --cache_text_encoder_outputs
When using --cache_text_encoder_outputs, caption augmentations (such as --shuffle_caption) are disabled and you cannot train text encoder LoRA modules. Add --network_train_unet_only to avoid an error.

SDXL-specific arguments

Dual text encoder learning rates

--text_encoder_lr1
number
Learning rate for LoRA modules in Text Encoder 1 (OpenCLIP ViT-G/14). Defaults to --learning_rate when omitted. Recommend a lower value than the U-Net (e.g., 1e-5).
--text_encoder_lr2
number
Learning rate for LoRA modules in Text Encoder 2 (CLIP ViT-L/14). Defaults to --learning_rate when omitted. Recommend a lower value than the U-Net (e.g., 1e-5).

VAE stability

--no_half_vae
boolean
Runs the VAE in float32 even when mixed precision is fp16 or bf16. The SDXL VAE can produce NaN values in float16. Always add this flag when using --mixed_precision=fp16.

Caching

--cache_latents
boolean
Pre-encodes all training images with the VAE and stores them in memory. Speeds up training and reduces VRAM because the VAE is not run during each step. Disables image augmentations (flip, color, random crop).
--cache_latents_to_disk
boolean
Like --cache_latents, but writes the cache to disk. On subsequent runs, the script loads the cache instead of re-encoding. Useful for large datasets.
--cache_text_encoder_outputs
boolean
Pre-computes text encoder outputs and stores them in memory. Significantly reduces VRAM. Disables caption augmentations and text encoder LoRA training. Requires --network_train_unet_only.
--cache_text_encoder_outputs_to_disk
boolean
Like --cache_text_encoder_outputs, but writes to disk.

Experimental options

--fused_backward_pass
boolean
Fuses gradient computation with the optimizer step to save VRAM. Currently only supported with Adafactor. Cannot be combined with gradient accumulation.

Dataset TOML for SDXL

Enable aspect ratio bucketing so the trainer can handle non-square crops at 1024px:
[general]
shuffle_caption = true
caption_extension = ".txt"

[[datasets]]
resolution = 1024
batch_size = 2
enable_bucket = true
bucket_reso_steps = 64
min_bucket_reso = 512
max_bucket_reso = 2048

  [[datasets.subsets]]
  image_dir = "/path/to/images"
  num_repeats = 10
bucket_reso_steps must be a multiple of 32 for SDXL. Using 64 is the recommended default.

GPU VRAMSuggested adjustments
24 GBDefault settings above work
16 GBAdd --cache_latents --cache_text_encoder_outputs, reduce batch size to 1
12 GBAdd --gradient_accumulation_steps=2, use Adafactor optimizer
8 GBAdd --full_bf16, reduce --network_dim to 16, increase gradient accumulation
bf16 precision is preferred over fp16 for SDXL because it has a wider dynamic range and avoids the VAE NaN issue without needing --no_half_vae.

Using the trained LoRA

When training completes, load the .safetensors file in any SDXL-compatible tool:
  • ComfyUI — use a LoraLoader node with the SDXL base checkpoint loaded.
  • AUTOMATIC1111 — place in models/Lora/ and reference with <lora:my_sdxl_lora:1> in your prompt. Make sure to load the SDXL checkpoint first.

Build docs developers (and LLMs) love