SDXL (Stable Diffusion XL)

Stable Diffusion XL (SDXL) is a higher-resolution successor to SD 1.x/2.x. It introduces a dual text encoder pipeline, a larger UNet, and a native resolution of 1024 × 1024. sd-scripts provides dedicated scripts for both LoRA training and full fine-tuning.

Architecture

SDXL uses the same latent diffusion pipeline as SD 1.x/2.x but with several key upgrades:

UNet — significantly larger than SD 1.x; operates at higher resolutions.
Dual text encoders
- Text Encoder 1: OpenCLIP ViT-G/14 (1280-dimensional embeddings).
- Text Encoder 2: CLIP ViT-L/14 (768-dimensional embeddings).
- Both encoders run in parallel; their outputs are concatenated before being fed to the UNet.
VAE — improved compared to SD 1.x, but can be numerically unstable in float16. Use --no_half_vae when training with fp16.

Key differences from SD 1.x

Feature	SD 1.x	SDXL
Text encoders	1 (CLIP ViT-L/14)	2 (OpenCLIP ViT-G/14 + CLIP ViT-L/14)
Native resolution	512 px	1024 × 1024
VAE stability in fp16	Stable	Unstable — use `--no_half_vae`
LoRA text encoder LRs	Single `--text_encoder_lr`	Separate `--text_encoder_lr1` / `--text_encoder_lr2`

Available training methods

Method	Script	Notes
LoRA	`sdxl_train_network.py`	Primary training method
Fine-tuning (native / DreamBooth)	`sdxl_train.py`	Full model or UNet-only
Textual Inversion	`sdxl_train_textual_inversion.py`	Trains new token embeddings
ControlNet-LLLite	`train_network.py` with control module	Lightweight ControlNet for SDXL
LECO	`train_network.py`	Concept editing via LoRA
LoKr / LoHa	`sdxl_train_network.py` with `networks.lokr` / `networks.loha`	Alternative network architectures

LoRA training

Use sdxl_train_network.py with --network_module=networks.lora:

accelerate launch --num_cpu_threads_per_process 1 sdxl_train_network.py \
  --pretrained_model_name_or_path="<SDXL base model path>" \
  --dataset_config="my_sdxl_dataset_config.toml" \
  --output_dir="<output directory>" \
  --output_name="my_sdxl_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora \
  --network_dim=32 \
  --network_alpha=16 \
  --learning_rate=1e-4 \
  --unet_lr=1e-4 \
  --text_encoder_lr1=1e-5 \
  --text_encoder_lr2=1e-5 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --no_half_vae \
  --cache_text_encoder_outputs \
  --cache_latents

--cache_text_encoder_outputs disables LoRA training on the text encoders. If you want to train text encoder LoRA modules, remove this flag and omit --network_train_unet_only.

When using --mixed_precision="fp16", always add --no_half_vae. SDXL’s VAE produces NaNs in float16, which corrupts training.

Fine-tuning

sdxl_train.py supports both native fine-tuning and DreamBooth-style training:

accelerate launch --num_cpu_threads_per_process 1 sdxl_train.py \
  --pretrained_model_name_or_path="<SDXL base model path>" \
  --dataset_config="my_sdxl_dataset_config.toml" \
  --output_dir="<output directory>" \
  --output_name="my_sdxl_finetuned" \
  --save_model_as=safetensors \
  --optimizer_type="Adafactor" \
  --optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=100 \
  --learning_rate=4e-7 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --cache_text_encoder_outputs \
  --cache_latents \
  --no_half_vae

For fine-tuning on a 24 GB GPU, train the UNet only (--network_train_unet_only), enable gradient checkpointing, cache text encoder outputs and latents, and use the Adafactor optimizer.

VRAM requirements

GPU VRAM	LoRA training	Fine-tuning
8 GB	Possible with `--cache_text_encoder_outputs`, `--cache_latents`, 8-bit optimizer, low rank (4–8)	Not practical
10 GB	Recommended minimum for LoRA	Not practical
16 GB	Comfortable for LoRA (rank 16–32)	Very limited
24 GB	Full LoRA training; fine-tuning with UNet-only + caching	Fine-tuning (batch size 1)

Key parameters

Parameter	Description	Recommendation
`--network_dim`	LoRA rank	16–32
`--network_alpha`	LoRA alpha	Half of `network_dim`
`--unet_lr`	UNet learning rate	`1e-4`
`--text_encoder_lr1`	OpenCLIP ViT-G/14 LR	`1e-5`
`--text_encoder_lr2`	CLIP ViT-L/14 LR	`1e-5`
`--no_half_vae`	Run VAE in float32	Required with `fp16`
`--bucket_reso_steps`	Bucket resolution step size	32 or 64

Supported Models

Network Modules

SDXL (Stable Diffusion XL)

Architecture

Key differences from SD 1.x

Available training methods

LoRA training

Fine-tuning

VRAM requirements

Key parameters

Build docs developers (and LLMs) love

Supported Models

Network Modules

​Architecture

​Key differences from SD 1.x

​Available training methods

​LoRA training

​Fine-tuning

​VRAM requirements

​Key parameters

Build docs developers (and LLMs) love

Architecture

Key differences from SD 1.x

Available training methods

LoRA training

Fine-tuning

VRAM requirements

Key parameters