Skip to main content
Stable Diffusion XL (SDXL) is a higher-resolution successor to SD 1.x/2.x. It introduces a dual text encoder pipeline, a larger UNet, and a native resolution of 1024 × 1024. sd-scripts provides dedicated scripts for both LoRA training and full fine-tuning.

Architecture

SDXL uses the same latent diffusion pipeline as SD 1.x/2.x but with several key upgrades:
  • UNet — significantly larger than SD 1.x; operates at higher resolutions.
  • Dual text encoders
    • Text Encoder 1: OpenCLIP ViT-G/14 (1280-dimensional embeddings).
    • Text Encoder 2: CLIP ViT-L/14 (768-dimensional embeddings).
    • Both encoders run in parallel; their outputs are concatenated before being fed to the UNet.
  • VAE — improved compared to SD 1.x, but can be numerically unstable in float16. Use --no_half_vae when training with fp16.

Key differences from SD 1.x

FeatureSD 1.xSDXL
Text encoders1 (CLIP ViT-L/14)2 (OpenCLIP ViT-G/14 + CLIP ViT-L/14)
Native resolution512 px1024 × 1024
VAE stability in fp16StableUnstable — use --no_half_vae
LoRA text encoder LRsSingle --text_encoder_lrSeparate --text_encoder_lr1 / --text_encoder_lr2

Available training methods

MethodScriptNotes
LoRAsdxl_train_network.pyPrimary training method
Fine-tuning (native / DreamBooth)sdxl_train.pyFull model or UNet-only
Textual Inversionsdxl_train_textual_inversion.pyTrains new token embeddings
ControlNet-LLLitetrain_network.py with control moduleLightweight ControlNet for SDXL
LECOtrain_network.pyConcept editing via LoRA
LoKr / LoHasdxl_train_network.py with networks.lokr / networks.lohaAlternative network architectures

LoRA training

Use sdxl_train_network.py with --network_module=networks.lora:
accelerate launch --num_cpu_threads_per_process 1 sdxl_train_network.py \
  --pretrained_model_name_or_path="<SDXL base model path>" \
  --dataset_config="my_sdxl_dataset_config.toml" \
  --output_dir="<output directory>" \
  --output_name="my_sdxl_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora \
  --network_dim=32 \
  --network_alpha=16 \
  --learning_rate=1e-4 \
  --unet_lr=1e-4 \
  --text_encoder_lr1=1e-5 \
  --text_encoder_lr2=1e-5 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --no_half_vae \
  --cache_text_encoder_outputs \
  --cache_latents
--cache_text_encoder_outputs disables LoRA training on the text encoders. If you want to train text encoder LoRA modules, remove this flag and omit --network_train_unet_only.
When using --mixed_precision="fp16", always add --no_half_vae. SDXL’s VAE produces NaNs in float16, which corrupts training.

Fine-tuning

sdxl_train.py supports both native fine-tuning and DreamBooth-style training:
accelerate launch --num_cpu_threads_per_process 1 sdxl_train.py \
  --pretrained_model_name_or_path="<SDXL base model path>" \
  --dataset_config="my_sdxl_dataset_config.toml" \
  --output_dir="<output directory>" \
  --output_name="my_sdxl_finetuned" \
  --save_model_as=safetensors \
  --optimizer_type="Adafactor" \
  --optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=100 \
  --learning_rate=4e-7 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --cache_text_encoder_outputs \
  --cache_latents \
  --no_half_vae
For fine-tuning on a 24 GB GPU, train the UNet only (--network_train_unet_only), enable gradient checkpointing, cache text encoder outputs and latents, and use the Adafactor optimizer.

VRAM requirements

GPU VRAMLoRA trainingFine-tuning
8 GBPossible with --cache_text_encoder_outputs, --cache_latents, 8-bit optimizer, low rank (4–8)Not practical
10 GBRecommended minimum for LoRANot practical
16 GBComfortable for LoRA (rank 16–32)Very limited
24 GBFull LoRA training; fine-tuning with UNet-only + cachingFine-tuning (batch size 1)

Key parameters

ParameterDescriptionRecommendation
--network_dimLoRA rank16–32
--network_alphaLoRA alphaHalf of network_dim
--unet_lrUNet learning rate1e-4
--text_encoder_lr1OpenCLIP ViT-G/14 LR1e-5
--text_encoder_lr2CLIP ViT-L/14 LR1e-5
--no_half_vaeRun VAE in float32Required with fp16
--bucket_reso_stepsBucket resolution step size32 or 64

Build docs developers (and LLMs) love