FLUX.1 is a Transformer-based image generation model from Black Forest Labs. Unlike Stable Diffusion, FLUX.1 uses a Diffusion Transformer (DiT) architecture with two text encoders and a dedicated AutoEncoder rather than a VAE.
Architecture
FLUX.1 departs from the UNet-based pipeline used in SD 1.x/2.x and SDXL:
- DiT (Diffusion Transformer) — replaces the UNet. Operates on patchified latent representations using bidirectional attention.
- Dual text encoders
- CLIP-L — fast encoder for short-to-medium prompts.
- T5-XXL — large language model encoder for long, complex prompts (up to 512 tokens by default).
- AutoEncoder (AE) — encodes and decodes between pixel and latent space. Not VAE-compatible with SD 1.x/2.x or SDXL.
Versions
| Version | Guidance | Use case |
|---|
| FLUX.1-dev | Distilled guidance embedding | General-purpose, recommended for training |
| FLUX.1-schnell | Flow matching, fewer steps | Fast inference, fewer training steps needed |
FLUX.1-dev is distilled with specific guidance scale values. During training, set --guidance_scale=1.0 to disable the guidance scale. The default value (3.5) is for inference, not training.
Required model files
Download the following files before training:
Do not use the weights from the Diffusers-format subfolder inside the FLUX.1-dev repository. These are in Diffusers format and cannot be used directly. Use the top-level flux1-dev.safetensors and ae.safetensors files.
Available training methods
| Method | Script | Notes |
|---|
| LoRA | flux_train_network.py | Primary training method |
| Fine-tuning | flux_train.py | Full model training |
| ControlNet | flux_train_control_net.py | ControlNet training |
LoRA training
Use flux_train_network.py with --network_module=networks.lora_flux:
accelerate launch --num_cpu_threads_per_process 1 flux_train_network.py \
--pretrained_model_name_or_path="flux1-dev.safetensors" \
--clip_l="clip_l.safetensors" \
--t5xxl="t5xxl_fp16.safetensors" \
--ae="ae.safetensors" \
--dataset_config="my_flux_dataset_config.toml" \
--output_dir="./output" \
--output_name="my_flux_lora" \
--save_model_as=safetensors \
--network_module=networks.lora_flux \
--network_dim=16 \
--network_alpha=1 \
--learning_rate=1e-4 \
--optimizer_type="AdamW8bit" \
--lr_scheduler="constant" \
--sdpa \
--max_train_epochs=10 \
--save_every_n_epochs=1 \
--mixed_precision="fp16" \
--gradient_checkpointing \
--guidance_scale=1.0 \
--timestep_sampling="flux_shift" \
--model_prediction_type="raw" \
--blocks_to_swap=18 \
--cache_text_encoder_outputs \
--cache_latents
--timestep_sampling="flux_shift" and --model_prediction_type="raw" are the recommended settings for FLUX.1-dev LoRA training.
Memory optimization
FLUX.1 is a large model. Use these options to reduce VRAM usage:
| GPU VRAM | Recommended settings |
|---|
| 24 GB | Standard settings (batch size 2) |
| 16 GB | Batch size 1 + --blocks_to_swap |
| 12 GB | --blocks_to_swap 16 + 8-bit AdamW |
| 10 GB | --blocks_to_swap 22 + fp8 T5-XXL |
| 8 GB | --blocks_to_swap 28 + fp8 T5-XXL |
Key memory options
--fp8_base — trains FLUX.1, CLIP-L, and T5-XXL in FP8 format. Significantly reduces VRAM at a potential quality cost.
--blocks_to_swap <n> — offloads n Transformer blocks to CPU. FLUX.1 supports up to 35 blocks. Cannot be combined with --cpu_offload_checkpointing.
--cache_text_encoder_outputs — caches CLIP-L and T5-XXL outputs; reduces memory usage but disables text encoder LoRA training.
--cache_latents / --cache_latents_to_disk — caches AE outputs.
Key training parameters
| Parameter | Description | Recommendation |
|---|
--network_module | Network module | networks.lora_flux |
--network_dim | LoRA rank | 16 |
--guidance_scale | Guidance scale during training | 1.0 for dev |
--timestep_sampling | Timestep sampling method | flux_shift |
--model_prediction_type | Prediction processing | raw |
--t5xxl_max_token_length | T5-XXL max tokens | 512 (default) |
Incompatible options
The following SD 1.x/2.x arguments are not used for FLUX.1 training and should not be specified:
--v2, --v_parameterization, --clip_skip
--max_token_length (use --t5xxl_max_token_length instead)
--split_mode (deprecated; use --blocks_to_swap)