Overview
sd3_train_network.py trains LoRA adapters for Stable Diffusion 3 (SD3) and Stable Diffusion 3.5 (Medium and Large). SD3 uses the MMDiT (Multi-Modal Diffusion Transformer) architecture instead of a U-Net, and employs three text encoders — CLIP-L, CLIP-G, and T5-XXL — which makes it structurally distinct from earlier Stable Diffusion models.
This guide assumes you are familiar with basic LoRA training concepts. See LoRA Training Overview and LoRA Training for SD 1.x/2.x for background on shared arguments.
Key differences from SD 1.x/2.x
| Feature | SD 1.x/2.x | SD3/SD3.5 |
|---|---|---|
| Script | train_network.py | sd3_train_network.py |
| Image model | U-Net | MMDiT (Transformer) |
| Text encoders | 1× CLIP | CLIP-L + CLIP-G + T5-XXL |
| VAE | SDXL-compatible | SD3-specific (not compatible with SDXL) |
| Model file | Single .safetensors | Single file (encoders auto-split) or separate paths |
--v2, --clip_skip | Supported | Not used |
Prerequisites
- The
sd-scriptsrepository cloned and the Python environment set up. - A prepared training dataset and a TOML config file.
- An SD3 or SD3.5 model file in
.safetensorsformat.
About model files
SD3/3.5 models are typically distributed as a single.safetensors file that bundles the MMDiT weights, all three text encoders, and the VAE. When you pass this file to --pretrained_model_name_or_path, the script automatically separates each component.
If your text encoders are stored as separate files, specify them with --clip_l, --clip_g, and --t5xxl. The --vae argument is only needed when you want to use a different VAE than the one included in the base model.
Training command
If the base model is a single
.safetensors that bundles all components, you can omit --clip_l, --clip_g, and --t5xxl. The script detects and extracts them automatically.SD3-specific arguments
Model loading
Path to the SD3/3.5 base model
.safetensors file.Path to the CLIP-L text encoder
.safetensors. Omit if using a bundled single-file model.Path to the CLIP-G text encoder
.safetensors. Omit if using a bundled single-file model.Path to the T5-XXL text encoder
.safetensors. Omit if using a bundled single-file model.Path to an alternative VAE
.safetensors. Not normally required; the bundled VAE is used by default.Text encoder options
Maximum token length for T5-XXL. Increasing this allows longer prompts but increases VRAM usage.
Applies a padding attention mask to CLIP-L and CLIP-G outputs.
Applies a padding attention mask to T5-XXL outputs.
Dropout rate for the CLIP-L text encoder during training.
Dropout rate for the CLIP-G text encoder during training.
Dropout rate for the T5-XXL text encoder during training.
Timestep and loss weighting
Shift applied to the timestep distribution to bias training toward certain noise levels.
Loss weighting method by timestep. Options:
uniform, logit_normal, mode, cosmap.Mean for the
logit_normal weighting scheme.Standard deviation for the
logit_normal weighting scheme.Scale factor for the
mode weighting scheme.SD3.5-specific options
Probability of randomly cropping the positional embedding. Intended for SD3.5 multi-resolution training.
Scales positional embeddings based on resolution when training at multiple resolutions. Experimental; SD3.5 only.
Memory optimization
Swaps a number of MMDiT Transformer blocks between CPU and GPU to reduce VRAM. Higher values save more VRAM but slow training. Cannot be combined with
--cpu_offload_checkpointing.Caches the outputs of all three text encoders to reduce VRAM and speed up training. Particularly effective for SD3 because three encoders run per step. Disables caption augmentations. Requires
--network_train_unet_only.Persists text encoder output cache to disk for reuse across training runs.
Targeting specific LoRA layers
By default, LoRA targets the attention (qkv) matrices and output projection (proj_out) in the MMDiT attention blocks, plus the final output layer.
Use --network_args to apply different ranks to specific layer groups:
| network_args key | Target layer |
|---|---|
context_attn_dim | Attention in context_block |
context_mlp_dim | MLP in context_block |
context_mod_dim | adaLN modulation in context_block |
x_attn_dim | Attention in x_block |
x_mlp_dim | MLP in x_block |
x_mod_dim | adaLN modulation in x_block |
0 disables LoRA for that layer group.
Conditioning layer LoRA
To apply LoRA to conditioning (embedding) layers, passemb_dims as a 6-element list:
context_embedder, t_embedder, x_embedder, y_embedder, final_layer_adaLN_modulation, final_layer_linear. A value of 0 skips that layer.
Targeting specific blocks
To train only a subset of MMDiT blocks, usetrain_block_indices:
all for all blocks or none for no blocks.
Incompatible arguments
These SD v1/v2 options are not used bysd3_train_network.py:
--v2,--v_parameterization--clip_skip
Using the trained LoRA
After training, load the.safetensors file in an inference environment that supports SD3/3.5, such as ComfyUI.