Stable Diffusion 1.x / 2.x

Stable Diffusion 1.x and 2.x are the foundational image generation models supported by sd-scripts. Both share the same UNet + VAE + CLIP pipeline architecture but differ in resolution targets and text encoder configuration.

Architecture

SD 1.x and 2.x use the classic latent diffusion architecture:

UNet — the denoising backbone that operates on compressed latent representations.
VAE — encodes images into latent space and decodes latents back to pixel space.
CLIP text encoder — conditions generation on text prompts.
- SD 1.x uses OpenAI CLIP ViT-L/14.
- SD 2.x uses OpenCLIP ViT-H/14 with a 1024-dimensional embedding.

Supported versions

Version	Default resolution	Notes
SD 1.x	512 × 512	Standard CLIP ViT-L/14 text encoder
SD 2.x	768 × 768	OpenCLIP ViT-H/14; supports v-parameterization

SD 2.x models require --v2 and, for v-prediction checkpoints, --v_parameterization. Omitting these flags when training against a v2 checkpoint produces incorrect results.

Available training methods

Method	Script	Notes
LoRA	`train_network.py`	Recommended starting point
DreamBooth fine-tuning	`train_db.py`	Full model or UNet-only
Native fine-tuning	`fine_tune.py`	Requires pre-cached latents
Textual Inversion	`train_textual_inversion.py`	Trains new token embeddings only
ControlNet-LLLite	`train_network.py` with control module	Lightweight ControlNet variant

LoRA training

Use train_network.py with --network_module=networks.lora:

accelerate launch --num_cpu_threads_per_process 1 train_network.py \
  --pretrained_model_name_or_path="<path to SD model>" \
  --dataset_config="dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_sd_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora \
  --network_dim=16 \
  --network_alpha=8 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="fp16" \
  --gradient_checkpointing \
  --cache_latents

A LoRA rank of 16–32 (--network_dim) is a good starting point for most subjects. Lower ranks (4–8) reduce file size at the cost of expressiveness; higher ranks (64+) can overfit with small datasets.

SD 2.x flags

When training against an SD 2.x checkpoint you must add the following flags:

--v2 \
--v_parameterization   # only for v-prediction checkpoints (e.g., stabilityai/stable-diffusion-2-1)

Not all SD 2.x checkpoints use v-parameterization. Check the model card before adding --v_parameterization. Applying it to an epsilon-prediction checkpoint degrades quality.

Textual Inversion

Textual Inversion trains new token embeddings without modifying the model weights. Use train_textual_inversion.py:

accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py \
  --pretrained_model_name_or_path="<path to SD model>" \
  --dataset_config="dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_embedding" \
  --save_model_as=safetensors \
  --max_train_steps=3000 \
  --learning_rate=5e-4 \
  --mixed_precision="fp16"

Key training parameters

Parameter	SD 1.x recommendation	SD 2.x recommendation
Resolution	512 px	768 px
`--network_dim` (LoRA rank)	16–32	16–32
`--mixed_precision`	`fp16`	`fp16`
`--v2`	not required	required
`--v_parameterization`	not required	required for v-pred models
`--clip_skip`	1 or 2 for community models	not used

Supported Models

Network Modules

Stable Diffusion 1.x / 2.x

Architecture

Supported versions

Available training methods

LoRA training

SD 2.x flags

Textual Inversion

Key training parameters

Build docs developers (and LLMs) love

Supported Models

Network Modules

​Architecture

​Supported versions

​Available training methods

​LoRA training

​SD 2.x flags

​Textual Inversion

​Key training parameters

Build docs developers (and LLMs) love

Architecture

Supported versions

Available training methods

LoRA training

SD 2.x flags

Textual Inversion

Key training parameters