Skip to main content
Stable Diffusion 3 (SD3) and SD3.5 use the Multimodal Diffusion Transformer (MMDiT) architecture, which replaces the UNet found in earlier models. They employ three text encoders in parallel, offering significantly improved prompt following compared to SD 1.x/2.x and SDXL.

Architecture

SD3/SD3.5 use the MMDiT (Multimodal Diffusion Transformer) architecture:
  • MMDiT — replaces the UNet. Processes both image and text representations jointly in a bidirectional attention framework, enabling tighter image-text alignment.
  • Three text encoders — all three run in parallel; their outputs are pooled and concatenated:
    • CLIP-L — OpenAI CLIP ViT-L/14.
    • CLIP-G — OpenCLIP ViT-G/14.
    • T5-XXL — large language model encoder for long, complex prompts.
  • VAE — not compatible with SDXL’s VAE.

Versions

VersionNotes
SD3 (Medium)Base SD3 architecture
SD3.5 MediumImproved quality; supports --pos_emb_random_crop_rate and --enable_scaled_pos_embed
SD3.5 LargeLarger model; higher VRAM requirement

Required model files

SD3 and SD3.5 models are distributed in two formats: Single-file format (recommended): A single .safetensors file containing the MMDiT, all text encoders, and the VAE. When you provide this file via --pretrained_model_name_or_path, the individual component paths (--clip_l, --clip_g, --t5xxl, --vae) are detected automatically. Separate files: If you have individual component files, specify each path explicitly:
ComponentArgument
MMDiT / base model--pretrained_model_name_or_path
CLIP-L--clip_l
CLIP-G--clip_g
T5-XXL--t5xxl
VAE--vae (optional if included in base file)

Available training methods

MethodScriptNotes
LoRAsd3_train_network.pyPrimary training method; uses networks.lora_sd3
Fine-tuningsd3_train.pyFull model training

LoRA training

Use sd3_train_network.py with --network_module=networks.lora_sd3:
accelerate launch --num_cpu_threads_per_process 1 sd3_train_network.py \
  --pretrained_model_name_or_path="<path to SD3 model>" \
  --clip_l="<path to CLIP-L model>" \
  --clip_g="<path to CLIP-G model>" \
  --t5xxl="<path to T5-XXL model>" \
  --dataset_config="my_sd3_dataset_config.toml" \
  --output_dir="<output directory>" \
  --output_name="my_sd3_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora_sd3 \
  --network_dim=16 \
  --network_alpha=1 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --sdpa \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="fp16" \
  --gradient_checkpointing \
  --weighting_scheme="uniform" \
  --blocks_to_swap=32
--cache_text_encoder_outputs is particularly effective for SD3 since three text encoders run simultaneously. Enable it whenever you are not training text encoder LoRA modules.

SD3.5-specific options

SD3.5 adds two extra parameters for positional embedding handling during multi-resolution training:
  • --pos_emb_random_crop_rate=<float> — probability of randomly cropping the positional embedding. Helps the model generalize across resolutions.
  • --enable_scaled_pos_embed (experimental) — scales positional embeddings to match training resolution. Use when training at multiple resolutions.

Per-layer LoRA rank control

You can set different LoRA ranks for each component of the MMDiT using --network_args:
--network_args \
  "context_attn_dim=16" \
  "context_mlp_dim=8" \
  "context_mod_dim=4" \
  "x_attn_dim=16" \
  "x_mlp_dim=8" \
  "x_mod_dim=4" \
  "verbose=True"
Setting a value to 0 disables LoRA for that layer. The verbose=True flag prints the effective rank for each layer during training. You can also apply LoRA to the conditioning layers with emb_dims (six values, one per layer):
--network_args "emb_dims=[4,4,4,4,4,4]"
The six positions correspond to: context_embedder, t_embedder, x_embedder, y_embedder, final_layer_adaLN_modulation, final_layer_linear.

Selective block training

Use train_block_indices to restrict which MMDiT blocks receive LoRA updates:
--network_args "train_block_indices=1,2,6-8"
Pass all to train all blocks (default) or none to train no blocks.

Memory and speed options

OptionEffect
--blocks_to_swap=<n>Offloads n Transformer blocks to CPU; reduces VRAM
--cache_text_encoder_outputsCaches all three text encoder outputs
--cache_text_encoder_outputs_to_diskPersists cache to disk across runs
--gradient_checkpointingReduces activation memory at a speed cost
--blocks_to_swap and --cpu_offload_checkpointing cannot be used at the same time.

Key training parameters

ParameterDescriptionDefaultRecommendation
--network_moduleNetwork modulenetworks.lora_sd3
--weighting_schemeTimestep loss weightinguniformuniform
--t5xxl_max_token_lengthT5-XXL max tokens256256
--training_shiftTimestep distribution shift1.01.0

Incompatible options

The following arguments are for SD 1.x/2.x and must not be used with SD3/SD3.5:
  • --v2, --v_parameterization, --clip_skip

Build docs developers (and LLMs) love