Stable Diffusion XL (SDXL) is a higher-resolution successor to SD 1.x/2.x. It introduces a dual text encoder pipeline, a larger UNet, and a native resolution of 1024 × 1024. sd-scripts provides dedicated scripts for both LoRA training and full fine-tuning.
Architecture
SDXL uses the same latent diffusion pipeline as SD 1.x/2.x but with several key upgrades:
- UNet — significantly larger than SD 1.x; operates at higher resolutions.
- Dual text encoders
- Text Encoder 1: OpenCLIP ViT-G/14 (1280-dimensional embeddings).
- Text Encoder 2: CLIP ViT-L/14 (768-dimensional embeddings).
- Both encoders run in parallel; their outputs are concatenated before being fed to the UNet.
- VAE — improved compared to SD 1.x, but can be numerically unstable in
float16. Use --no_half_vae when training with fp16.
Key differences from SD 1.x
| Feature | SD 1.x | SDXL |
|---|
| Text encoders | 1 (CLIP ViT-L/14) | 2 (OpenCLIP ViT-G/14 + CLIP ViT-L/14) |
| Native resolution | 512 px | 1024 × 1024 |
| VAE stability in fp16 | Stable | Unstable — use --no_half_vae |
| LoRA text encoder LRs | Single --text_encoder_lr | Separate --text_encoder_lr1 / --text_encoder_lr2 |
Available training methods
| Method | Script | Notes |
|---|
| LoRA | sdxl_train_network.py | Primary training method |
| Fine-tuning (native / DreamBooth) | sdxl_train.py | Full model or UNet-only |
| Textual Inversion | sdxl_train_textual_inversion.py | Trains new token embeddings |
| ControlNet-LLLite | train_network.py with control module | Lightweight ControlNet for SDXL |
| LECO | train_network.py | Concept editing via LoRA |
| LoKr / LoHa | sdxl_train_network.py with networks.lokr / networks.loha | Alternative network architectures |
LoRA training
Use sdxl_train_network.py with --network_module=networks.lora:
accelerate launch --num_cpu_threads_per_process 1 sdxl_train_network.py \
--pretrained_model_name_or_path="<SDXL base model path>" \
--dataset_config="my_sdxl_dataset_config.toml" \
--output_dir="<output directory>" \
--output_name="my_sdxl_lora" \
--save_model_as=safetensors \
--network_module=networks.lora \
--network_dim=32 \
--network_alpha=16 \
--learning_rate=1e-4 \
--unet_lr=1e-4 \
--text_encoder_lr1=1e-5 \
--text_encoder_lr2=1e-5 \
--optimizer_type="AdamW8bit" \
--lr_scheduler="constant" \
--max_train_epochs=10 \
--save_every_n_epochs=1 \
--mixed_precision="bf16" \
--gradient_checkpointing \
--no_half_vae \
--cache_text_encoder_outputs \
--cache_latents
--cache_text_encoder_outputs disables LoRA training on the text encoders. If you want to train text encoder LoRA modules, remove this flag and omit --network_train_unet_only.
When using --mixed_precision="fp16", always add --no_half_vae. SDXL’s VAE produces NaNs in float16, which corrupts training.
Fine-tuning
sdxl_train.py supports both native fine-tuning and DreamBooth-style training:
accelerate launch --num_cpu_threads_per_process 1 sdxl_train.py \
--pretrained_model_name_or_path="<SDXL base model path>" \
--dataset_config="my_sdxl_dataset_config.toml" \
--output_dir="<output directory>" \
--output_name="my_sdxl_finetuned" \
--save_model_as=safetensors \
--optimizer_type="Adafactor" \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--learning_rate=4e-7 \
--mixed_precision="bf16" \
--gradient_checkpointing \
--cache_text_encoder_outputs \
--cache_latents \
--no_half_vae
For fine-tuning on a 24 GB GPU, train the UNet only (--network_train_unet_only), enable gradient checkpointing, cache text encoder outputs and latents, and use the Adafactor optimizer.
VRAM requirements
| GPU VRAM | LoRA training | Fine-tuning |
|---|
| 8 GB | Possible with --cache_text_encoder_outputs, --cache_latents, 8-bit optimizer, low rank (4–8) | Not practical |
| 10 GB | Recommended minimum for LoRA | Not practical |
| 16 GB | Comfortable for LoRA (rank 16–32) | Very limited |
| 24 GB | Full LoRA training; fine-tuning with UNet-only + caching | Fine-tuning (batch size 1) |
Key parameters
| Parameter | Description | Recommendation |
|---|
--network_dim | LoRA rank | 16–32 |
--network_alpha | LoRA alpha | Half of network_dim |
--unet_lr | UNet learning rate | 1e-4 |
--text_encoder_lr1 | OpenCLIP ViT-G/14 LR | 1e-5 |
--text_encoder_lr2 | CLIP ViT-L/14 LR | 1e-5 |
--no_half_vae | Run VAE in float32 | Required with fp16 |
--bucket_reso_steps | Bucket resolution step size | 32 or 64 |