Stable Diffusion 3 (SD3) and SD3.5 use the Multimodal Diffusion Transformer (MMDiT) architecture, which replaces the UNet found in earlier models. They employ three text encoders in parallel, offering significantly improved prompt following compared to SD 1.x/2.x and SDXL.
Architecture
SD3/SD3.5 use the MMDiT (Multimodal Diffusion Transformer) architecture:
- MMDiT — replaces the UNet. Processes both image and text representations jointly in a bidirectional attention framework, enabling tighter image-text alignment.
- Three text encoders — all three run in parallel; their outputs are pooled and concatenated:
- CLIP-L — OpenAI CLIP ViT-L/14.
- CLIP-G — OpenCLIP ViT-G/14.
- T5-XXL — large language model encoder for long, complex prompts.
- VAE — not compatible with SDXL’s VAE.
Versions
| Version | Notes |
|---|
| SD3 (Medium) | Base SD3 architecture |
| SD3.5 Medium | Improved quality; supports --pos_emb_random_crop_rate and --enable_scaled_pos_embed |
| SD3.5 Large | Larger model; higher VRAM requirement |
Required model files
SD3 and SD3.5 models are distributed in two formats:
Single-file format (recommended): A single .safetensors file containing the MMDiT, all text encoders, and the VAE. When you provide this file via --pretrained_model_name_or_path, the individual component paths (--clip_l, --clip_g, --t5xxl, --vae) are detected automatically.
Separate files: If you have individual component files, specify each path explicitly:
| Component | Argument |
|---|
| MMDiT / base model | --pretrained_model_name_or_path |
| CLIP-L | --clip_l |
| CLIP-G | --clip_g |
| T5-XXL | --t5xxl |
| VAE | --vae (optional if included in base file) |
Available training methods
| Method | Script | Notes |
|---|
| LoRA | sd3_train_network.py | Primary training method; uses networks.lora_sd3 |
| Fine-tuning | sd3_train.py | Full model training |
LoRA training
Use sd3_train_network.py with --network_module=networks.lora_sd3:
accelerate launch --num_cpu_threads_per_process 1 sd3_train_network.py \
--pretrained_model_name_or_path="<path to SD3 model>" \
--clip_l="<path to CLIP-L model>" \
--clip_g="<path to CLIP-G model>" \
--t5xxl="<path to T5-XXL model>" \
--dataset_config="my_sd3_dataset_config.toml" \
--output_dir="<output directory>" \
--output_name="my_sd3_lora" \
--save_model_as=safetensors \
--network_module=networks.lora_sd3 \
--network_dim=16 \
--network_alpha=1 \
--learning_rate=1e-4 \
--optimizer_type="AdamW8bit" \
--lr_scheduler="constant" \
--sdpa \
--max_train_epochs=10 \
--save_every_n_epochs=1 \
--mixed_precision="fp16" \
--gradient_checkpointing \
--weighting_scheme="uniform" \
--blocks_to_swap=32
--cache_text_encoder_outputs is particularly effective for SD3 since three text encoders run simultaneously. Enable it whenever you are not training text encoder LoRA modules.
SD3.5-specific options
SD3.5 adds two extra parameters for positional embedding handling during multi-resolution training:
--pos_emb_random_crop_rate=<float> — probability of randomly cropping the positional embedding. Helps the model generalize across resolutions.
--enable_scaled_pos_embed (experimental) — scales positional embeddings to match training resolution. Use when training at multiple resolutions.
Per-layer LoRA rank control
You can set different LoRA ranks for each component of the MMDiT using --network_args:
--network_args \
"context_attn_dim=16" \
"context_mlp_dim=8" \
"context_mod_dim=4" \
"x_attn_dim=16" \
"x_mlp_dim=8" \
"x_mod_dim=4" \
"verbose=True"
Setting a value to 0 disables LoRA for that layer. The verbose=True flag prints the effective rank for each layer during training.
You can also apply LoRA to the conditioning layers with emb_dims (six values, one per layer):
--network_args "emb_dims=[4,4,4,4,4,4]"
The six positions correspond to: context_embedder, t_embedder, x_embedder, y_embedder, final_layer_adaLN_modulation, final_layer_linear.
Selective block training
Use train_block_indices to restrict which MMDiT blocks receive LoRA updates:
--network_args "train_block_indices=1,2,6-8"
Pass all to train all blocks (default) or none to train no blocks.
Memory and speed options
| Option | Effect |
|---|
--blocks_to_swap=<n> | Offloads n Transformer blocks to CPU; reduces VRAM |
--cache_text_encoder_outputs | Caches all three text encoder outputs |
--cache_text_encoder_outputs_to_disk | Persists cache to disk across runs |
--gradient_checkpointing | Reduces activation memory at a speed cost |
--blocks_to_swap and --cpu_offload_checkpointing cannot be used at the same time.
Key training parameters
| Parameter | Description | Default | Recommendation |
|---|
--network_module | Network module | — | networks.lora_sd3 |
--weighting_scheme | Timestep loss weighting | uniform | uniform |
--t5xxl_max_token_length | T5-XXL max tokens | 256 | 256 |
--training_shift | Timestep distribution shift | 1.0 | 1.0 |
Incompatible options
The following arguments are for SD 1.x/2.x and must not be used with SD3/SD3.5:
--v2, --v_parameterization, --clip_skip