LoRA Training for SD3/SD3.5

Overview

sd3_train_network.py trains LoRA adapters for Stable Diffusion 3 (SD3) and Stable Diffusion 3.5 (Medium and Large). SD3 uses the MMDiT (Multi-Modal Diffusion Transformer) architecture instead of a U-Net, and employs three text encoders — CLIP-L, CLIP-G, and T5-XXL — which makes it structurally distinct from earlier Stable Diffusion models.

This guide assumes you are familiar with basic LoRA training concepts. See LoRA Training Overview and LoRA Training for SD 1.x/2.x for background on shared arguments.

Key differences from SD 1.x/2.x

Feature	SD 1.x/2.x	SD3/SD3.5
Script	`train_network.py`	`sd3_train_network.py`
Image model	U-Net	MMDiT (Transformer)
Text encoders	1× CLIP	CLIP-L + CLIP-G + T5-XXL
VAE	SDXL-compatible	SD3-specific (not compatible with SDXL)
Model file	Single `.safetensors`	Single file (encoders auto-split) or separate paths
`--v2`, `--clip_skip`	Supported	Not used

Prerequisites

The sd-scripts repository cloned and the Python environment set up.
A prepared training dataset and a TOML config file.
An SD3 or SD3.5 model file in .safetensors format.

About model files

SD3/3.5 models are typically distributed as a single .safetensors file that bundles the MMDiT weights, all three text encoders, and the VAE. When you pass this file to --pretrained_model_name_or_path, the script automatically separates each component. If your text encoders are stored as separate files, specify them with --clip_l, --clip_g, and --t5xxl. The --vae argument is only needed when you want to use a different VAE than the one included in the base model.

Training command

accelerate launch --num_cpu_threads_per_process 1 sd3_train_network.py \
  --pretrained_model_name_or_path="/path/to/sd3-model.safetensors" \
  --clip_l="/path/to/clip_l.safetensors" \
  --clip_g="/path/to/clip_g.safetensors" \
  --t5xxl="/path/to/t5xxl.safetensors" \
  --dataset_config="my_sd3_dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_sd3_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora \
  --network_dim=16 \
  --network_alpha=1 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --sdpa \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="fp16" \
  --gradient_checkpointing \
  --weighting_scheme="uniform" \
  --blocks_to_swap=32

If the base model is a single .safetensors that bundles all components, you can omit --clip_l, --clip_g, and --t5xxl. The script detects and extracts them automatically.

SD3-specific arguments

Model loading

--pretrained_model_name_or_path

string

required

Path to the SD3/3.5 base model .safetensors file.

--clip_l

string

Path to the CLIP-L text encoder .safetensors. Omit if using a bundled single-file model.

--clip_g

string

Path to the CLIP-G text encoder .safetensors. Omit if using a bundled single-file model.

--t5xxl

string

Path to the T5-XXL text encoder .safetensors. Omit if using a bundled single-file model.

--vae

string

Path to an alternative VAE .safetensors. Not normally required; the bundled VAE is used by default.

Text encoder options

--t5xxl_max_token_length

integer

default:"256"

Maximum token length for T5-XXL. Increasing this allows longer prompts but increases VRAM usage.

--apply_lg_attn_mask

boolean

Applies a padding attention mask to CLIP-L and CLIP-G outputs.

--apply_t5_attn_mask

boolean

Applies a padding attention mask to T5-XXL outputs.

--clip_l_dropout_rate

number

default:"0.0"

Dropout rate for the CLIP-L text encoder during training.

--clip_g_dropout_rate

number

default:"0.0"

Dropout rate for the CLIP-G text encoder during training.

--t5_dropout_rate

number

default:"0.0"

Dropout rate for the T5-XXL text encoder during training.

Timestep and loss weighting

--training_shift

number

default:"1.0"

Shift applied to the timestep distribution to bias training toward certain noise levels.

--weighting_scheme

string

default:"uniform"

Loss weighting method by timestep. Options: uniform, logit_normal, mode, cosmap.

--logit_mean

number

default:"0.0"

Mean for the logit_normal weighting scheme.

--logit_std

number

default:"1.0"

Standard deviation for the logit_normal weighting scheme.

--mode_scale

number

default:"1.29"

Scale factor for the mode weighting scheme.

SD3.5-specific options

--pos_emb_random_crop_rate

number

Probability of randomly cropping the positional embedding. Intended for SD3.5 multi-resolution training.

--enable_scaled_pos_embed

boolean

Scales positional embeddings based on resolution when training at multiple resolutions. Experimental; SD3.5 only.

Memory optimization

--blocks_to_swap

integer

Swaps a number of MMDiT Transformer blocks between CPU and GPU to reduce VRAM. Higher values save more VRAM but slow training. Cannot be combined with --cpu_offload_checkpointing.

--cache_text_encoder_outputs

boolean

Caches the outputs of all three text encoders to reduce VRAM and speed up training. Particularly effective for SD3 because three encoders run per step. Disables caption augmentations. Requires --network_train_unet_only.

--cache_text_encoder_outputs_to_disk

boolean

Persists text encoder output cache to disk for reuse across training runs.

Targeting specific LoRA layers

By default, LoRA targets the attention (qkv) matrices and output projection (proj_out) in the MMDiT attention blocks, plus the final output layer. Use --network_args to apply different ranks to specific layer groups:

--network_args \
  "context_attn_dim=16" \
  "context_mlp_dim=8" \
  "context_mod_dim=4" \
  "x_attn_dim=16" \
  "x_mlp_dim=8" \
  "x_mod_dim=4" \
  "verbose=True"

network_args key	Target layer
`context_attn_dim`	Attention in `context_block`
`context_mlp_dim`	MLP in `context_block`
`context_mod_dim`	adaLN modulation in `context_block`
`x_attn_dim`	Attention in `x_block`
`x_mlp_dim`	MLP in `x_block`
`x_mod_dim`	adaLN modulation in `x_block`

Setting a value to 0 disables LoRA for that layer group.

Conditioning layer LoRA

To apply LoRA to conditioning (embedding) layers, pass emb_dims as a 6-element list:

--network_args "emb_dims=[4,0,0,4,0,0]"

The six positions correspond, in order, to: context_embedder, t_embedder, x_embedder, y_embedder, final_layer_adaLN_modulation, final_layer_linear. A value of 0 skips that layer.

Targeting specific blocks

To train only a subset of MMDiT blocks, use train_block_indices:

--network_args "train_block_indices=0,1,4-6,10"

Accepts individual indices and ranges. Use all for all blocks or none for no blocks.

Incompatible arguments

These SD v1/v2 options are not used by sd3_train_network.py:

--v2, --v_parameterization
--clip_skip

Using the trained LoRA

After training, load the .safetensors file in an inference environment that supports SD3/3.5, such as ComfyUI.

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

LoRA Training for SD3/SD3.5

Overview

Key differences from SD 1.x/2.x

Prerequisites

About model files

Training command

SD3-specific arguments

Model loading

Text encoder options

Timestep and loss weighting

SD3.5-specific options

Memory optimization

Targeting specific LoRA layers

Conditioning layer LoRA

Targeting specific blocks

Incompatible arguments

Using the trained LoRA

Build docs developers (and LLMs) love

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

​Overview

​Key differences from SD 1.x/2.x

​Prerequisites

​About model files

​Training command

​SD3-specific arguments

​Model loading

​Text encoder options

​Timestep and loss weighting

​SD3.5-specific options

​Memory optimization

​Targeting specific LoRA layers

​Conditioning layer LoRA

​Targeting specific blocks

​Incompatible arguments

​Using the trained LoRA

Build docs developers (and LLMs) love

Overview

Key differences from SD 1.x/2.x

Prerequisites

About model files

Training command

SD3-specific arguments

Model loading

Text encoder options

Timestep and loss weighting

SD3.5-specific options

Memory optimization

Targeting specific LoRA layers

Conditioning layer LoRA

Targeting specific blocks

Incompatible arguments

Using the trained LoRA