Stable Diffusion 1.x and 2.x are the foundational image generation models supported by sd-scripts. Both share the same UNet + VAE + CLIP pipeline architecture but differ in resolution targets and text encoder configuration.
Architecture
SD 1.x and 2.x use the classic latent diffusion architecture:
- UNet — the denoising backbone that operates on compressed latent representations.
- VAE — encodes images into latent space and decodes latents back to pixel space.
- CLIP text encoder — conditions generation on text prompts.
- SD 1.x uses OpenAI CLIP ViT-L/14.
- SD 2.x uses OpenCLIP ViT-H/14 with a 1024-dimensional embedding.
Supported versions
| Version | Default resolution | Notes |
|---|
| SD 1.x | 512 × 512 | Standard CLIP ViT-L/14 text encoder |
| SD 2.x | 768 × 768 | OpenCLIP ViT-H/14; supports v-parameterization |
SD 2.x models require --v2 and, for v-prediction checkpoints, --v_parameterization. Omitting these flags when training against a v2 checkpoint produces incorrect results.
Available training methods
| Method | Script | Notes |
|---|
| LoRA | train_network.py | Recommended starting point |
| DreamBooth fine-tuning | train_db.py | Full model or UNet-only |
| Native fine-tuning | fine_tune.py | Requires pre-cached latents |
| Textual Inversion | train_textual_inversion.py | Trains new token embeddings only |
| ControlNet-LLLite | train_network.py with control module | Lightweight ControlNet variant |
LoRA training
Use train_network.py with --network_module=networks.lora:
accelerate launch --num_cpu_threads_per_process 1 train_network.py \
--pretrained_model_name_or_path="<path to SD model>" \
--dataset_config="dataset_config.toml" \
--output_dir="./output" \
--output_name="my_sd_lora" \
--save_model_as=safetensors \
--network_module=networks.lora \
--network_dim=16 \
--network_alpha=8 \
--learning_rate=1e-4 \
--optimizer_type="AdamW8bit" \
--lr_scheduler="constant" \
--max_train_epochs=10 \
--save_every_n_epochs=1 \
--mixed_precision="fp16" \
--gradient_checkpointing \
--cache_latents
A LoRA rank of 16–32 (--network_dim) is a good starting point for most subjects. Lower ranks (4–8) reduce file size at the cost of expressiveness; higher ranks (64+) can overfit with small datasets.
SD 2.x flags
When training against an SD 2.x checkpoint you must add the following flags:
--v2 \
--v_parameterization # only for v-prediction checkpoints (e.g., stabilityai/stable-diffusion-2-1)
Not all SD 2.x checkpoints use v-parameterization. Check the model card before adding --v_parameterization. Applying it to an epsilon-prediction checkpoint degrades quality.
Textual Inversion
Textual Inversion trains new token embeddings without modifying the model weights. Use train_textual_inversion.py:
accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py \
--pretrained_model_name_or_path="<path to SD model>" \
--dataset_config="dataset_config.toml" \
--output_dir="./output" \
--output_name="my_embedding" \
--save_model_as=safetensors \
--max_train_steps=3000 \
--learning_rate=5e-4 \
--mixed_precision="fp16"
Key training parameters
| Parameter | SD 1.x recommendation | SD 2.x recommendation |
|---|
| Resolution | 512 px | 768 px |
--network_dim (LoRA rank) | 16–32 | 16–32 |
--mixed_precision | fp16 | fp16 |
--v2 | not required | required |
--v_parameterization | not required | required for v-pred models |
--clip_skip | 1 or 2 for community models | not used |