Overview
sdxl_train_network.py is the LoRA training script for Stable Diffusion XL (SDXL). It shares most arguments with train_network.py but adds SDXL-specific features, including separate learning rates for the two text encoders and options to cache outputs and reduce VRAM usage.
Read LoRA Training for SD 1.x/2.x first for an explanation of the shared arguments. This page focuses on SDXL differences.
Key differences from SD 1.x/2.x training
| Aspect | SD 1.x/2.x | SDXL |
|---|---|---|
| Script | train_network.py | sdxl_train_network.py |
| Text encoders | 1 (CLIP ViT-L) | 2 (OpenCLIP ViT-G/14 + CLIP ViT-L/14) |
| Text encoder LR arg | --text_encoder_lr | --text_encoder_lr1 + --text_encoder_lr2 |
| Recommended precision | fp16 | bf16 preferred, fp16 with --no_half_vae |
| Typical resolution | 512px | 1024px |
| VAE stability | Generally stable | May be unstable in float16 |
--v2 / --v_parameterization | Required for v2.x | Not used |
--clip_skip | Optional | Not used |
Prerequisites
- A Stable Diffusion XL base model. You can use
stabilityai/stable-diffusion-xl-base-1.0from Hugging Face or a local.safetensorsfile. - A dataset prepared at a higher resolution (1024×1024 is standard). Enable aspect ratio bucketing in your TOML config with
enable_bucket = true.
Training command
SDXL-specific arguments
Dual text encoder learning rates
Learning rate for LoRA modules in Text Encoder 1 (OpenCLIP ViT-G/14). Defaults to
--learning_rate when omitted. Recommend a lower value than the U-Net (e.g., 1e-5).Learning rate for LoRA modules in Text Encoder 2 (CLIP ViT-L/14). Defaults to
--learning_rate when omitted. Recommend a lower value than the U-Net (e.g., 1e-5).VAE stability
Runs the VAE in float32 even when mixed precision is
fp16 or bf16. The SDXL VAE can produce NaN values in float16. Always add this flag when using --mixed_precision=fp16.Caching
Pre-encodes all training images with the VAE and stores them in memory. Speeds up training and reduces VRAM because the VAE is not run during each step. Disables image augmentations (flip, color, random crop).
Like
--cache_latents, but writes the cache to disk. On subsequent runs, the script loads the cache instead of re-encoding. Useful for large datasets.Pre-computes text encoder outputs and stores them in memory. Significantly reduces VRAM. Disables caption augmentations and text encoder LoRA training. Requires
--network_train_unet_only.Like
--cache_text_encoder_outputs, but writes to disk.Experimental options
Fuses gradient computation with the optimizer step to save VRAM. Currently only supported with
Adafactor. Cannot be combined with gradient accumulation.Dataset TOML for SDXL
Enable aspect ratio bucketing so the trainer can handle non-square crops at 1024px:bucket_reso_steps must be a multiple of 32 for SDXL. Using 64 is the recommended default.Recommended settings by VRAM
| GPU VRAM | Suggested adjustments |
|---|---|
| 24 GB | Default settings above work |
| 16 GB | Add --cache_latents --cache_text_encoder_outputs, reduce batch size to 1 |
| 12 GB | Add --gradient_accumulation_steps=2, use Adafactor optimizer |
| 8 GB | Add --full_bf16, reduce --network_dim to 16, increase gradient accumulation |
Using the trained LoRA
When training completes, load the.safetensors file in any SDXL-compatible tool:
- ComfyUI — use a
LoraLoadernode with the SDXL base checkpoint loaded. - AUTOMATIC1111 — place in
models/Lora/and reference with<lora:my_sdxl_lora:1>in your prompt. Make sure to load the SDXL checkpoint first.
