Skip to main content

Overview

sd3_train_network.py trains LoRA adapters for Stable Diffusion 3 (SD3) and Stable Diffusion 3.5 (Medium and Large). SD3 uses the MMDiT (Multi-Modal Diffusion Transformer) architecture instead of a U-Net, and employs three text encoders — CLIP-L, CLIP-G, and T5-XXL — which makes it structurally distinct from earlier Stable Diffusion models.
This guide assumes you are familiar with basic LoRA training concepts. See LoRA Training Overview and LoRA Training for SD 1.x/2.x for background on shared arguments.

Key differences from SD 1.x/2.x

FeatureSD 1.x/2.xSD3/SD3.5
Scripttrain_network.pysd3_train_network.py
Image modelU-NetMMDiT (Transformer)
Text encoders1× CLIPCLIP-L + CLIP-G + T5-XXL
VAESDXL-compatibleSD3-specific (not compatible with SDXL)
Model fileSingle .safetensorsSingle file (encoders auto-split) or separate paths
--v2, --clip_skipSupportedNot used

Prerequisites

  • The sd-scripts repository cloned and the Python environment set up.
  • A prepared training dataset and a TOML config file.
  • An SD3 or SD3.5 model file in .safetensors format.

About model files

SD3/3.5 models are typically distributed as a single .safetensors file that bundles the MMDiT weights, all three text encoders, and the VAE. When you pass this file to --pretrained_model_name_or_path, the script automatically separates each component. If your text encoders are stored as separate files, specify them with --clip_l, --clip_g, and --t5xxl. The --vae argument is only needed when you want to use a different VAE than the one included in the base model.

Training command

accelerate launch --num_cpu_threads_per_process 1 sd3_train_network.py \
  --pretrained_model_name_or_path="/path/to/sd3-model.safetensors" \
  --clip_l="/path/to/clip_l.safetensors" \
  --clip_g="/path/to/clip_g.safetensors" \
  --t5xxl="/path/to/t5xxl.safetensors" \
  --dataset_config="my_sd3_dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_sd3_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora \
  --network_dim=16 \
  --network_alpha=1 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --sdpa \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="fp16" \
  --gradient_checkpointing \
  --weighting_scheme="uniform" \
  --blocks_to_swap=32
If the base model is a single .safetensors that bundles all components, you can omit --clip_l, --clip_g, and --t5xxl. The script detects and extracts them automatically.

SD3-specific arguments

Model loading

--pretrained_model_name_or_path
string
required
Path to the SD3/3.5 base model .safetensors file.
--clip_l
string
Path to the CLIP-L text encoder .safetensors. Omit if using a bundled single-file model.
--clip_g
string
Path to the CLIP-G text encoder .safetensors. Omit if using a bundled single-file model.
--t5xxl
string
Path to the T5-XXL text encoder .safetensors. Omit if using a bundled single-file model.
--vae
string
Path to an alternative VAE .safetensors. Not normally required; the bundled VAE is used by default.

Text encoder options

--t5xxl_max_token_length
integer
default:"256"
Maximum token length for T5-XXL. Increasing this allows longer prompts but increases VRAM usage.
--apply_lg_attn_mask
boolean
Applies a padding attention mask to CLIP-L and CLIP-G outputs.
--apply_t5_attn_mask
boolean
Applies a padding attention mask to T5-XXL outputs.
--clip_l_dropout_rate
number
default:"0.0"
Dropout rate for the CLIP-L text encoder during training.
--clip_g_dropout_rate
number
default:"0.0"
Dropout rate for the CLIP-G text encoder during training.
--t5_dropout_rate
number
default:"0.0"
Dropout rate for the T5-XXL text encoder during training.

Timestep and loss weighting

--training_shift
number
default:"1.0"
Shift applied to the timestep distribution to bias training toward certain noise levels.
--weighting_scheme
string
default:"uniform"
Loss weighting method by timestep. Options: uniform, logit_normal, mode, cosmap.
--logit_mean
number
default:"0.0"
Mean for the logit_normal weighting scheme.
--logit_std
number
default:"1.0"
Standard deviation for the logit_normal weighting scheme.
--mode_scale
number
default:"1.29"
Scale factor for the mode weighting scheme.

SD3.5-specific options

--pos_emb_random_crop_rate
number
Probability of randomly cropping the positional embedding. Intended for SD3.5 multi-resolution training.
--enable_scaled_pos_embed
boolean
Scales positional embeddings based on resolution when training at multiple resolutions. Experimental; SD3.5 only.

Memory optimization

--blocks_to_swap
integer
Swaps a number of MMDiT Transformer blocks between CPU and GPU to reduce VRAM. Higher values save more VRAM but slow training. Cannot be combined with --cpu_offload_checkpointing.
--cache_text_encoder_outputs
boolean
Caches the outputs of all three text encoders to reduce VRAM and speed up training. Particularly effective for SD3 because three encoders run per step. Disables caption augmentations. Requires --network_train_unet_only.
--cache_text_encoder_outputs_to_disk
boolean
Persists text encoder output cache to disk for reuse across training runs.

Targeting specific LoRA layers

By default, LoRA targets the attention (qkv) matrices and output projection (proj_out) in the MMDiT attention blocks, plus the final output layer. Use --network_args to apply different ranks to specific layer groups:
--network_args \
  "context_attn_dim=16" \
  "context_mlp_dim=8" \
  "context_mod_dim=4" \
  "x_attn_dim=16" \
  "x_mlp_dim=8" \
  "x_mod_dim=4" \
  "verbose=True"
network_args keyTarget layer
context_attn_dimAttention in context_block
context_mlp_dimMLP in context_block
context_mod_dimadaLN modulation in context_block
x_attn_dimAttention in x_block
x_mlp_dimMLP in x_block
x_mod_dimadaLN modulation in x_block
Setting a value to 0 disables LoRA for that layer group.

Conditioning layer LoRA

To apply LoRA to conditioning (embedding) layers, pass emb_dims as a 6-element list:
--network_args "emb_dims=[4,0,0,4,0,0]"
The six positions correspond, in order, to: context_embedder, t_embedder, x_embedder, y_embedder, final_layer_adaLN_modulation, final_layer_linear. A value of 0 skips that layer.

Targeting specific blocks

To train only a subset of MMDiT blocks, use train_block_indices:
--network_args "train_block_indices=0,1,4-6,10"
Accepts individual indices and ranges. Use all for all blocks or none for no blocks.

Incompatible arguments

These SD v1/v2 options are not used by sd3_train_network.py:
  • --v2, --v_parameterization
  • --clip_skip

Using the trained LoRA

After training, load the .safetensors file in an inference environment that supports SD3/3.5, such as ComfyUI.

Build docs developers (and LLMs) love