Skip to main content

Overview

flux_train_network.py trains LoRA adapters for the FLUX.1 model family (dev and schnell variants). FLUX.1 differs fundamentally from Stable Diffusion: it uses a Transformer-based architecture (DiT) instead of a U-Net, requires two separate text encoders (CLIP-L and T5-XXL), and uses a dedicated AutoEncoder (AE) instead of a standard VAE.
This guide assumes you are familiar with basic LoRA training concepts. See LoRA Training Overview and LoRA Training for SD 1.x/2.x for background.

Architecture differences

FeatureSD 1.x/2.xFLUX.1
Image modelU-NetTransformer (DiT)
Text encoders1× CLIPCLIP-L + T5-XXL
Latent encoderVAEAutoEncoder (AE)
Network modulenetworks.loranetworks.lora_flux
Model file arg--pretrained_model_name_or_path--pretrained_model_name_or_path
Additional model args--clip_l, --t5xxl, --ae

Required model files

You need four separate model files before training:
Download from the black-forest-labs/FLUX.1-dev repository on Hugging Face.Use flux1-dev.safetensors. The weights inside the repository subfolders are in Diffusers format and are not compatible with this script.
Download ae.safetensors from the same black-forest-labs/FLUX.1-dev repository.
Download from the comfyanonymous/flux_text_encoders repository on Hugging Face.Use t5xxl_fp16.safetensors for the fp16 version.
Download clip_l.safetensors from the same comfyanonymous/flux_text_encoders repository.

Training command

accelerate launch --num_cpu_threads_per_process 1 flux_train_network.py \
  --pretrained_model_name_or_path="/path/to/flux1-dev.safetensors" \
  --clip_l="/path/to/clip_l.safetensors" \
  --t5xxl="/path/to/t5xxl_fp16.safetensors" \
  --ae="/path/to/ae.safetensors" \
  --dataset_config="my_flux_dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_flux_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora_flux \
  --network_dim=16 \
  --network_alpha=1 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --sdpa \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="fp16" \
  --gradient_checkpointing \
  --guidance_scale=1.0 \
  --timestep_sampling="flux_shift" \
  --model_prediction_type="raw" \
  --blocks_to_swap=18 \
  --cache_text_encoder_outputs \
  --cache_latents

FLUX.1-specific arguments

Model loading

--pretrained_model_name_or_path
string
required
Path to the FLUX.1 or Chroma .safetensors file. Diffusers-format directories are not supported.
--model_type
string
default:"flux"
Base model type. Use flux for FLUX.1 dev/schnell or chroma for Chroma models.
--clip_l
string
required
Path to the CLIP-L text encoder .safetensors file. Required when --model_type=flux. Omit for Chroma.
--t5xxl
string
required
Path to the T5-XXL text encoder .safetensors file.
--ae
string
required
Path to the FLUX.1-compatible AutoEncoder .safetensors file.

Training behavior

--guidance_scale
number
default:"3.5"
Guidance scale for the distilled FLUX.1 dev model. Set to 1.0 during training to disable embedded guidance. For Chroma, set to 0.0. Usually ignored for the schnell variant.
--timestep_sampling
string
default:"sigma"
Method for sampling timesteps during training. Options: sigma, uniform, sigmoid, shift, flux_shift. Recommended value for FLUX.1 is flux_shift. For Chroma, use sigmoid.
--sigmoid_scale
number
default:"1.0"
Scale factor when --timestep_sampling is sigmoid, shift, or flux_shift.
--model_prediction_type
string
default:"sigma_scaled"
What the model predicts. Options: raw, additive, sigma_scaled. Recommended value is raw.
--discrete_flow_shift
number
default:"3.0"
Scheduler shift value for Flow Matching. Only used when --timestep_sampling=shift.
--apply_t5_attn_mask
boolean
Applies an attention mask to T5-XXL outputs. Required for Chroma models.

Memory optimization

FLUX.1 is a large model. Use the following options to fit training into limited VRAM:
Loads the FLUX.1 DiT, CLIP-L, and T5-XXL in FP8 format to significantly reduce VRAM.
--fp8_base
Requires PyTorch 2.1 or later. Results may differ slightly from full-precision training.
Offloads a number of Transformer blocks from GPU to CPU, reducing peak VRAM at the cost of training speed. FLUX.1 supports up to 35 blocks for swapping.
--blocks_to_swap=18
Cannot be used together with --cpu_offload_checkpointing.
Offloads gradient checkpoints to CPU. Saves up to 1 GB of VRAM but reduces training speed by around 15%. Cannot be used with --blocks_to_swap. Not supported for Chroma models.
--cpu_offload_checkpointing
Pre-compute and cache text encoder and AE outputs to skip those passes during each training step.
--cache_text_encoder_outputs \
--cache_latents
Add --cache_text_encoder_outputs_to_disk and --cache_latents_to_disk to persist the cache between runs.
GPU VRAMRecommended settings
24 GBBasic settings, batch size 2
16 GBBatch size 1, --blocks_to_swap=8
12 GB--blocks_to_swap=16, 8-bit AdamW
10 GB--blocks_to_swap=22, fp8 T5-XXL recommended
8 GB--blocks_to_swap=28, fp8 T5-XXL recommended

Incompatible arguments

The following SD-specific flags have no effect in FLUX.1 training and should be omitted:
  • --v2, --v_parameterization
  • --clip_skip
  • --max_token_length (use --t5xxl_max_token_length instead)
  • --split_mode (deprecated; use --blocks_to_swap)

Using the trained LoRA

After training, load the .safetensors file in an inference environment that supports FLUX.1, such as ComfyUI with the FLUX nodes, or any tool that supports the FLUX architecture.

Build docs developers (and LLMs) love