Skip to main content
Stable Diffusion 1.x and 2.x are the foundational image generation models supported by sd-scripts. Both share the same UNet + VAE + CLIP pipeline architecture but differ in resolution targets and text encoder configuration.

Architecture

SD 1.x and 2.x use the classic latent diffusion architecture:
  • UNet — the denoising backbone that operates on compressed latent representations.
  • VAE — encodes images into latent space and decodes latents back to pixel space.
  • CLIP text encoder — conditions generation on text prompts.
    • SD 1.x uses OpenAI CLIP ViT-L/14.
    • SD 2.x uses OpenCLIP ViT-H/14 with a 1024-dimensional embedding.

Supported versions

VersionDefault resolutionNotes
SD 1.x512 × 512Standard CLIP ViT-L/14 text encoder
SD 2.x768 × 768OpenCLIP ViT-H/14; supports v-parameterization
SD 2.x models require --v2 and, for v-prediction checkpoints, --v_parameterization. Omitting these flags when training against a v2 checkpoint produces incorrect results.

Available training methods

MethodScriptNotes
LoRAtrain_network.pyRecommended starting point
DreamBooth fine-tuningtrain_db.pyFull model or UNet-only
Native fine-tuningfine_tune.pyRequires pre-cached latents
Textual Inversiontrain_textual_inversion.pyTrains new token embeddings only
ControlNet-LLLitetrain_network.py with control moduleLightweight ControlNet variant

LoRA training

Use train_network.py with --network_module=networks.lora:
accelerate launch --num_cpu_threads_per_process 1 train_network.py \
  --pretrained_model_name_or_path="<path to SD model>" \
  --dataset_config="dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_sd_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora \
  --network_dim=16 \
  --network_alpha=8 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="fp16" \
  --gradient_checkpointing \
  --cache_latents
A LoRA rank of 16–32 (--network_dim) is a good starting point for most subjects. Lower ranks (4–8) reduce file size at the cost of expressiveness; higher ranks (64+) can overfit with small datasets.

SD 2.x flags

When training against an SD 2.x checkpoint you must add the following flags:
--v2 \
--v_parameterization   # only for v-prediction checkpoints (e.g., stabilityai/stable-diffusion-2-1)
Not all SD 2.x checkpoints use v-parameterization. Check the model card before adding --v_parameterization. Applying it to an epsilon-prediction checkpoint degrades quality.

Textual Inversion

Textual Inversion trains new token embeddings without modifying the model weights. Use train_textual_inversion.py:
accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py \
  --pretrained_model_name_or_path="<path to SD model>" \
  --dataset_config="dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_embedding" \
  --save_model_as=safetensors \
  --max_train_steps=3000 \
  --learning_rate=5e-4 \
  --mixed_precision="fp16"

Key training parameters

ParameterSD 1.x recommendationSD 2.x recommendation
Resolution512 px768 px
--network_dim (LoRA rank)16–3216–32
--mixed_precisionfp16fp16
--v2not requiredrequired
--v_parameterizationnot requiredrequired for v-pred models
--clip_skip1 or 2 for community modelsnot used

Build docs developers (and LLMs) love