Skip to main content

Overview

The sd-scripts training scripts expose a large set of advanced options beyond the basic --network_dim and --learning_rate flags. This page covers the most impactful options for users who want precise control over training behavior.
The examples on this page use sdxl_train_network.py for illustration, but most options also apply to train_network.py, flux_train_network.py, and sd3_train_network.py.

Block-wise LoRA dimensions and alphas

By default, every layer in the U-Net gets the same rank (--network_dim) and alpha (--network_alpha). Block-wise settings let you assign different ranks to different parts of the network, which is useful when you want to concentrate the adapter’s capacity in specific layers.For SDXL, the U-Net has 23 blocks. You pass a comma-separated list of 23 integers to block_dims and block_alphas via --network_args.
--network_args \
  "block_dims=2,2,2,2,4,4,4,4,8,8,8,8,8,8,8,8,4,4,4,4,2,2,2" \
  "block_alphas=1,1,1,1,2,2,2,2,4,4,4,4,4,4,4,4,2,2,2,2,1,1,1"
Any block not listed falls back to the global --network_dim / --network_alpha values.
To also control the 3×3 convolution layers block by block, add conv_block_dims and conv_block_alphas:
--network_args \
  "block_dims=2,2,2,2,4,4,4,4,8,8,8,8,8,8,8,8,4,4,4,4,2,2,2" \
  "block_alphas=1,1,1,1,2,2,2,2,4,4,4,4,4,4,4,4,2,2,2,2,1,1,1" \
  "conv_block_dims=2,2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,2,2,2,2,2,2,2" \
  "conv_block_alphas=1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1"

LoRA+

LoRA+ sets the learning rate of the UP weight matrices to a multiple of the DOWN matrices’ learning rate. The paper suggests this speeds up learning because the two matrices have different optimal learning rates. A ratio of 16 is recommended.
--network_args "loraplus_lr_ratio=16"
You can also set separate ratios for U-Net and text encoders:
--network_args \
  "loraplus_unet_lr_ratio=16" \
  "loraplus_text_encoder_lr_ratio=4"
LoRA+ is not compatible with auto-LR optimizers such as DAdaptation or Prodigy.

DyLoRA

DyLoRA trains a range of ranks simultaneously, so you can select the effective rank at inference time without retraining. Use networks.dylora as the network module and specify the rank range with unit:
--network_module=networks.dylora \
--network_dim=64 \
--network_args "unit=4"
This trains ranks 4, 8, 12, …, 64 simultaneously. At inference you can use any multiple of unit up to network_dim by adjusting the LoRA multiplier.

Learning rate schedulers

Decays the learning rate following a cosine curve from the initial value to zero over the full training run.
--lr_scheduler="cosine"
Like cosine, but restarts the cosine curve N times throughout training. Useful for escaping local minima.
--lr_scheduler="cosine_with_restarts" \
--lr_scheduler_num_cycles=3
Decays the learning rate according to a polynomial function. Control the shape with --lr_scheduler_power.
--lr_scheduler="polynomial" \
--lr_scheduler_power=2
Keeps the learning rate constant after a warmup phase. Useful when you want the optimizer to stabilize before full-speed learning.
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=500
When --lr_warmup_steps is less than 1, it is interpreted as a fraction of the total number of training steps:
--lr_warmup_steps=0.05
This sets warmup to 5% of total steps, automatically.

Optimizer options

Highly memory-efficient optimizer. Useful when VRAM is critically limited. Recommended with relative_step=True and the adafactor scheduler.
--optimizer_type="Adafactor" \
--optimizer_args "relative_step=True" "scale_parameter=True" "warmup_init=True" \
--lr_scheduler="adafactor"
Sign-based gradient update optimizer. Can converge faster than AdamW for some tasks. Requires lion-pytorch.
--optimizer_type="Lion" \
--learning_rate=1e-5
Use a lower learning rate than AdamW (~10× lower).
Auto-adjusting learning rate optimizer. Set the initial learning rate to 1.0 and let Prodigy tune it during training.
--optimizer_type="Prodigy" \
--learning_rate=1.0 \
--lr_scheduler="constant"
Not compatible with LoRA+.
Use --optimizer_args to pass key=value pairs to the optimizer:
--optimizer_args "weight_decay=0.01" "betas=0.9,0.999"

Mixed precision

Both fp16 and bf16 reduce VRAM usage compared to full float32 training.
FormatDynamic rangePrecisionBest for
fp16SmallerHigherSD 1.x/2.x, older GPUs
bf16LargerLowerSDXL, FLUX.1, SD3; RTX 3000+, A100
--mixed_precision="bf16"
Use bf16 whenever your GPU supports it (Ampere and later, or any Tensor Core GPU). It avoids the NaN issues that can occur with the SDXL VAE under fp16.
For critical VRAM situations, you can force gradient computations entirely in half precision:
--full_bf16
# or
--full_fp16
This can cause training instability. Monitor your loss carefully and consider adding --max_grad_norm=1.0.
Load the base model in FP8 to save significant VRAM. Requires PyTorch 2.1+.
--fp8_base
--fp8_base_unet loads only the U-Net in FP8, leaving text encoders in the default precision.

Gradient checkpointing

Gradient checkpointing trades compute time for memory. Activations are not stored during the forward pass; instead they are recomputed during backpropagation.
--gradient_checkpointing
Reduces VRAM by roughly 30–50% for large models. Training becomes 10–20% slower.
Accumulate gradients over multiple steps before updating the optimizer. Effective batch size becomes train_batch_size × gradient_accumulation_steps.
--gradient_accumulation_steps=4
Use this to simulate a larger batch size when VRAM is limited.
Clip the gradient norm to prevent instability when the learning rate is high.
--max_grad_norm=1.0
Set to 0 to disable gradient clipping entirely.

Saving checkpoints

Save a checkpoint every N epochs or every N steps:
--save_every_n_epochs=2
--save_every_n_steps=500
Both can be specified simultaneously. Each triggers independently.
Prevent disk from filling up by keeping only the most recent M checkpoints:
--save_last_n_epochs=5
# or
--save_last_n_steps=3000
Save the full training state (optimizer, step counter) so you can resume later:
--save_state \
--save_last_n_epochs_state=2
Use --save_state_on_train_end to save the state only at the end of a run.

Resuming training

Use --resume to continue from a state directory saved by --save_state. This restores the optimizer state, step counter, and epoch counter.
--resume="./output/my_lora-state-epoch00005"
--resume restores the full training state. If you only want to start from existing LoRA weights (without restoring optimizer state), use --network_weights instead.
Load pre-trained LoRA weights and continue training from them without restoring optimizer state:
--network_weights="./output/my_lora.safetensors"
Add --dim_from_weights to automatically read the rank from the weight file:
--network_weights="./output/my_lora.safetensors" \
--dim_from_weights

Noise techniques

Adds a constant offset to the noise during training. Improves the model’s ability to generate very bright or very dark images. SDXL base models are trained with noise offset, so enabling it during LoRA training can help match the base model’s distribution.
--noise_offset=0.0357 \
--noise_offset_random_strength
Adds noise at multiple frequency scales simultaneously. Can improve fine detail reproduction.
--multires_noise_iterations=6 \
--multires_noise_discount=0.3
Re-weights the training loss across timesteps to stabilize early training, where very high noise timesteps can dominate the gradient.
--min_snr_gamma=5
Adds a small amount of noise to the input latents for regularization.
--ip_noise_gamma=0.1

Network training scope

By default, both the U-Net and the text encoder receive LoRA modules. You can restrict training to one part:
# Train only U-Net LoRA (required when --cache_text_encoder_outputs is set)
--network_train_unet_only

# Train only text encoder LoRA
--network_train_text_encoder_only

Weight norm scaling

Scale the magnitude of LoRA weights during training to help control overfitting:
--scale_weight_norms=1.0
A value of 1.0 is a reasonable starting point.

Differential LoRA (merging existing weights)

Merge one or more existing LoRA files into the base model before starting a new training run. This lets you train the “difference” from an existing LoRA.
--base_weights="./existing_lora.safetensors" \
--base_weights_multiplier=1.0
Multiple weight files and multipliers can be specified by repeating the arguments.

Logging and tracking

--logging_dir="./logs" \
--log_with="tensorboard"
Then launch TensorBoard:
tensorboard --logdir ./logs
--logging_dir="./logs" \
--log_with="wandb" \
--wandb_api_key="your_api_key" \
--wandb_run_name="my_experiment"
Install with pip install wandb before use.
Record the full training configuration at the start of each run for reproducibility:
--log_config

Using a config file instead of command-line arguments

For long training commands, store all arguments in a TOML file and pass it with --config_file:
pretrained_model_name_or_path = "/path/to/model.safetensors"
dataset_config = "my_dataset.toml"
output_dir = "./output"
output_name = "my_lora"
network_module = "networks.lora"
network_dim = 32
network_alpha = 16
learning_rate = 1e-4
optimizer_type = "AdamW8bit"
lr_scheduler = "cosine_with_restarts"
max_train_epochs = 10
mixed_precision = "bf16"
gradient_checkpointing = true
accelerate launch sdxl_train_network.py --config_file="training_config.toml"
Use --output_config to dump the current command-line arguments to a TOML file you can reuse later.

Build docs developers (and LLMs) love