SD3 / SD3.5

Stable Diffusion 3 (SD3) and SD3.5 use the Multimodal Diffusion Transformer (MMDiT) architecture, which replaces the UNet found in earlier models. They employ three text encoders in parallel, offering significantly improved prompt following compared to SD 1.x/2.x and SDXL.

Architecture

SD3/SD3.5 use the MMDiT (Multimodal Diffusion Transformer) architecture:

MMDiT — replaces the UNet. Processes both image and text representations jointly in a bidirectional attention framework, enabling tighter image-text alignment.
Three text encoders — all three run in parallel; their outputs are pooled and concatenated:
- CLIP-L — OpenAI CLIP ViT-L/14.
- CLIP-G — OpenCLIP ViT-G/14.
- T5-XXL — large language model encoder for long, complex prompts.
VAE — not compatible with SDXL’s VAE.

Versions

Version	Notes
SD3 (Medium)	Base SD3 architecture
SD3.5 Medium	Improved quality; supports `--pos_emb_random_crop_rate` and `--enable_scaled_pos_embed`
SD3.5 Large	Larger model; higher VRAM requirement

Required model files

SD3 and SD3.5 models are distributed in two formats: Single-file format (recommended): A single .safetensors file containing the MMDiT, all text encoders, and the VAE. When you provide this file via --pretrained_model_name_or_path, the individual component paths (--clip_l, --clip_g, --t5xxl, --vae) are detected automatically. Separate files: If you have individual component files, specify each path explicitly:

Component	Argument
MMDiT / base model	`--pretrained_model_name_or_path`
CLIP-L	`--clip_l`
CLIP-G	`--clip_g`
T5-XXL	`--t5xxl`
VAE	`--vae` (optional if included in base file)

Available training methods

Method	Script	Notes
LoRA	`sd3_train_network.py`	Primary training method; uses `networks.lora_sd3`
Fine-tuning	`sd3_train.py`	Full model training

LoRA training

Use sd3_train_network.py with --network_module=networks.lora_sd3:

accelerate launch --num_cpu_threads_per_process 1 sd3_train_network.py \
  --pretrained_model_name_or_path="<path to SD3 model>" \
  --clip_l="<path to CLIP-L model>" \
  --clip_g="<path to CLIP-G model>" \
  --t5xxl="<path to T5-XXL model>" \
  --dataset_config="my_sd3_dataset_config.toml" \
  --output_dir="<output directory>" \
  --output_name="my_sd3_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora_sd3 \
  --network_dim=16 \
  --network_alpha=1 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --sdpa \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="fp16" \
  --gradient_checkpointing \
  --weighting_scheme="uniform" \
  --blocks_to_swap=32

--cache_text_encoder_outputs is particularly effective for SD3 since three text encoders run simultaneously. Enable it whenever you are not training text encoder LoRA modules.

SD3.5-specific options

SD3.5 adds two extra parameters for positional embedding handling during multi-resolution training:

--pos_emb_random_crop_rate=<float> — probability of randomly cropping the positional embedding. Helps the model generalize across resolutions.
--enable_scaled_pos_embed (experimental) — scales positional embeddings to match training resolution. Use when training at multiple resolutions.

Per-layer LoRA rank control

You can set different LoRA ranks for each component of the MMDiT using --network_args:

--network_args \
  "context_attn_dim=16" \
  "context_mlp_dim=8" \
  "context_mod_dim=4" \
  "x_attn_dim=16" \
  "x_mlp_dim=8" \
  "x_mod_dim=4" \
  "verbose=True"

Setting a value to 0 disables LoRA for that layer. The verbose=True flag prints the effective rank for each layer during training. You can also apply LoRA to the conditioning layers with emb_dims (six values, one per layer):

--network_args "emb_dims=[4,4,4,4,4,4]"

The six positions correspond to: context_embedder, t_embedder, x_embedder, y_embedder, final_layer_adaLN_modulation, final_layer_linear.

Selective block training

Use train_block_indices to restrict which MMDiT blocks receive LoRA updates:

--network_args "train_block_indices=1,2,6-8"

Pass all to train all blocks (default) or none to train no blocks.

Memory and speed options

Option	Effect
`--blocks_to_swap=<n>`	Offloads `n` Transformer blocks to CPU; reduces VRAM
`--cache_text_encoder_outputs`	Caches all three text encoder outputs
`--cache_text_encoder_outputs_to_disk`	Persists cache to disk across runs
`--gradient_checkpointing`	Reduces activation memory at a speed cost

--blocks_to_swap and --cpu_offload_checkpointing cannot be used at the same time.

Key training parameters

Parameter	Description	Default	Recommendation
`--network_module`	Network module	—	`networks.lora_sd3`
`--weighting_scheme`	Timestep loss weighting	`uniform`	`uniform`
`--t5xxl_max_token_length`	T5-XXL max tokens	`256`	`256`
`--training_shift`	Timestep distribution shift	`1.0`	`1.0`

Incompatible options

The following arguments are for SD 1.x/2.x and must not be used with SD3/SD3.5:

--v2, --v_parameterization, --clip_skip

Supported Models

Network Modules

Architecture

Versions

Required model files

Available training methods

LoRA training

SD3.5-specific options

Per-layer LoRA rank control

Selective block training

Memory and speed options

Key training parameters

Incompatible options

Build docs developers (and LLMs) love

Supported Models

Network Modules

​Architecture

​Versions

​Required model files

​Available training methods

​LoRA training

​SD3.5-specific options

​Per-layer LoRA rank control

​Selective block training

​Memory and speed options

​Key training parameters

​Incompatible options

Build docs developers (and LLMs) love

Architecture

Versions

Required model files

Available training methods

LoRA training

SD3.5-specific options

Per-layer LoRA rank control

Selective block training

Memory and speed options

Key training parameters

Incompatible options