Stable Diffusion 1.x / 2.x
SD 1.x and 2.x are the foundational models for image generation, widely compatible and efficient.
SD 1.5 (SD15)
Architecture:- Context dimension: 768
- Model channels: 320
- Latent format: SD15 (4 channels, 8x spatial downscaling)
- Memory usage factor: 1.0
- Supports regular generation and inpainting
- Compatible with LoRAs, hypernetworks, and embeddings
- Works with unCLIP models
- InstuctPix2Pix variant available for image editing
SD 2.0 / 2.1 (SD20)
Architecture:- Context dimension: 1024
- Uses linear transformers
- Supports V-prediction and EPS prediction
- Higher precision (float32) attention for better quality
- SD21UnclipL: Uses CLIP-L vision encoder with 1536 ADM channels
- SD21UnclipH: Uses CLIP-H vision encoder with 2048 ADM channels
- SD_X4Upscaler: 4x upscaling model with 7 input channels
- Location:
models/upscale_models/ - Latent format: SD_X4 (downscales by 1x)
- Location:
SDXL (Stable Diffusion XL)
SDXL Base
Architecture:- Model channels: 320
- Transformer depth: [0, 0, 2, 2, 10, 10]
- Context dimension: 2048
- ADM channels: 2816
- Latent format: SDXL (4 channels, 8x spatial downscaling)
- Memory usage factor: 0.8
- CLIP-L (OpenCLIP ViT-L)
- CLIP-G (OpenCLIP ViT-bigG)
- EPS (standard)
- V-prediction (with
v_predkey) - EDM (Playground v2.5 variant)
- V-prediction EDM (with configurable sigma)
- ZSNR support for anime checkpoints
- Inpainting model support
- InstuctPix2Pix variant available
SDXL Refiner
Purpose: Adds fine details to SDXL base generations Architecture:- Model channels: 384
- Transformer depth: [0, 0, 4, 4, 4, 4, 0, 0]
- Context dimension: 1280
- Uses only CLIP-G encoder
SDXL Optimized Variants
SSD-1B (Segmind Small)- Transformer depth: [0, 0, 2, 2, 4, 4]
- Faster inference, smaller model size
- Transformer depth: [0, 0, 1, 1, 2, 2]
- Ultra-fast inference
- KOALA-700M: Transformer depth [0, 2, 5]
- KOALA-1B: Transformer depth [0, 2, 6]
Stable Diffusion 3 (SD3)
Architecture:- In channels: 16
- DiT-based (Diffusion Transformer)
- Latent format: SD3 (16 channels, 8x spatial downscaling)
- Memory usage factor: 1.6
- Default shift: 3.0
- CLIP-L (optional)
- CLIP-G (optional)
- T5-XXL (optional)
- SD3 Medium
- SD3.5 Large
- SD3.5 Medium
Flux
Flux Dev
Architecture:- Image model: flux
- Guidance embed: True
- Latent format: Flux (16 channels, 8x spatial downscaling)
- Memory usage factor: 3.1
- Supported dtypes: BF16, FP16, FP32
- CLIP-L
- T5-XXL
- Guidance scale control (default 3.5)
- Can disable guidance completely for specific use cases
Flux Schnell
Optimizations:- No guidance embed (faster inference)
- Flow matching model type
- Shift: 1.0, multiplier: 1.0
- Optimized for 4-step generation
Flux Inpaint
Specifics:- In channels: 96 (to accommodate mask and reference image)
- Same architecture as Flux Dev otherwise
- Supported dtypes: BF16, FP32 (FP16 excluded)
Flux 2
Architecture:- Updated model: flux2
- Latent format: Flux2 (16 channels)
- Shift: 2.02
- Scales memory usage by model size
- Qwen3-4B: Small, efficient
- Qwen3-8B: Balanced
- Mistral3-24B: Highest quality (can be pruned)
- Flux Kontext: Image editing model with multi-reference support
- Use
FluxKontextImageScaleto optimize input resolution - Preferred resolutions: 672x1568 to 1568x672
- Use
Stable Cascade
Architecture: Two-stage model with innovative approachStage C (Prior)
Purpose: Generates low-resolution prior Architecture:- Stable cascade stage: ‘c’
- Latent format: SC_Prior
- Shift: 2.0
- Supported dtypes: BF16, FP32
- Text encoder (CLIP)
- CLIP vision encoder
- VAE
Stage B (Decoder)
Purpose: Upscales prior to full resolution Architecture:- Stable cascade stage: ‘b’
- Latent format: SC_B
- Shift: 1.0
- Supported dtypes: FP16, BF16, FP32
- Stage C generates compressed latent
- Stage B decodes to full image
Specialized Image Models
PixArt Alpha / Sigma
PixArt Alpha:- Image model: pixart_alpha
- Latent format: SD15
- Memory usage factor: 0.5
- Uses T5-XXL encoder
- Sampling: sqrt_linear beta schedule
- Image model: pixart_sigma
- Latent format: SDXL
- Improved quality over Alpha
AuraFlow
Architecture:- Conditional sequence dimension: 2048
- Latent format: SDXL
- Shift: 1.73, multiplier: 1.0
- Uses custom AuraT5 encoder
HunyuanDiT / HunyuanDiT1
HunyuanDiT:- Image model: hydit
- Latent format: SDXL
- Memory usage factor: 1.3
- Text encoders: CLIP + mT5 (multilingual)
- Linear start: 0.00085, end: 0.018
- Updated architecture
- Linear end: 0.03 (extended range)
- Excellent Chinese language support
- Multilingual capabilities
Hunyuan Image 2.1
Architecture:- Based on HunyuanVideo architecture
- Image model: hunyuan_video (adapted)
- Latent format: HunyuanImage21
- Memory usage factor: 8.7
- Shift: 5.0
- Supported dtypes: BF16, FP32
- Llama-based architecture
Lumina Image 2.0
Architecture:- Image model: lumina2
- Latent format: Flux (16 channels)
- Memory usage factor: 1.4
- Shift: 6.0, multiplier: 1.0
- Supported dtypes: BF16, FP32
- Gemma2-2B
Z Image
Standard Model:- Dimension: 3840
- Memory usage factor: 2.8
- Shift: 3.0
- Supported dtypes: BF16, FP32 (FP16 with extended support)
- Text encoder: Qwen3-4B
- No VAE required - operates on raw RGB patches
- Memory usage factor: 0.03 (extremely efficient)
- Latent format: ZImagePixelSpace (no spatial compression)
Qwen Image
Architecture:- Image model: qwen_image
- Latent format: Wan21
- Memory usage factor: 1.8
- Shift: 1.15, multiplier: 1.0
- Supported dtypes: BF16, FP32
- Qwen2.5-7B
- Standard generation model
- Edit model variant available
HiDream / HiDream E1.1
HiDream:- Image model: hidream
- Latent format: Flux
- Shift: 3.0
- Supported dtypes: BF16, FP32
- Image editing variant
Image Editing Models
Omnigen 2
Architecture:- Image model: omnigen2
- Latent format: Flux
- Memory usage factor: 1.95
- Shift: 2.6, multiplier: 1.0
- Supported dtypes: FP16 (with extended support), BF16, FP32
- Qwen2.5-3B
- Multi-modal editing
- Instruction-based editing
3D View Synthesis
Stable Zero123
Architecture:- Context dimension: 768
- In channels: 8 (includes conditioning)
- Uses CLIP vision encoder
- Latent format: SD15
- Novel view synthesis from single image
- 3D-consistent image generation
SV3D (Stable Video Diffusion 3D)
SV3D-u (Unguided):- ADM channels: 256
- Generates orbital views
- ADM channels: 1280
- Accepts camera pose conditioning
Model Files
Default Locations
Configuration
Customize model paths usingextra_model_paths.yaml:
Performance Tips
Memory Management
-
Enable model offloading: Automatically moves models between GPU/CPU
-
Use appropriate precision:
- FP32: Highest quality, most VRAM
- FP16: Good balance (not all models support)
- BF16: Best for modern GPUs (Ampere+)
-
Tiled VAE: For high-resolution images
- Use
VAEEncodeTiledandVAEDecodeTilednodes
- Use
Speed Optimizations
-
Torch compile: Faster inference (experimental)
-
Attention optimization:
- Default: Auto-selects best method
- Override with
--attention-pytorchor--attention-split
-
Preview methods:
--preview-method auto: Enables latent previews- Use TAESD for higher quality previews