LUMINA

Lumina Image 2.0 is a Next-generation Diffusion Transformer (Next-DiT) model. It uses a single Gemma2 language model as its text encoder and a dedicated AutoEncoder, making it architecturally distinct from Stable Diffusion and FLUX.1.

Architecture

Next-DiT — Next-generation Diffusion Transformer architecture. Replaces the UNet with a transformer that natively handles variable-resolution inputs.
Gemma2 (2B) — single text encoder based on Google’s Gemma2 language model. Handles both prompt encoding and system prompt conditioning.
AutoEncoder (AE) — the same AE used by FLUX.1 (ae.safetensors).

Required model files

Download the following files before training:

Component	File	Source
Lumina Image 2.0 DiT	`lumina-image-2.safetensors` (full precision) or `lumina_2_model_bf16.safetensors` (bf16)	rockerBOO/lumina-image-2 / Comfy-Org/Lumina_Image_2.0_Repackaged
Gemma2 2B text encoder	`gemma_2_2b_fp16.safetensors`	Comfy-Org/Lumina_Image_2.0_Repackaged
AutoEncoder	`ae.safetensors`	Comfy-Org/Lumina_Image_2.0_Repackaged

The AutoEncoder for Lumina Image 2.0 is the same file as the FLUX.1 AE. If you already have ae.safetensors from a FLUX.1 setup, you can reuse it here.

Available training methods

Method	Script	Notes
LoRA	`lumina_train_network.py`	Uses `networks.lora_lumina`
Fine-tuning	`lumina_train.py`	Full model training

LoRA training

Use lumina_train_network.py with --network_module=networks.lora_lumina:

accelerate launch --num_cpu_threads_per_process 1 lumina_train_network.py \
  --pretrained_model_name_or_path="lumina-image-2.safetensors" \
  --gemma2="gemma_2_2b_fp16.safetensors" \
  --ae="ae.safetensors" \
  --dataset_config="my_lumina_dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_lumina_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora_lumina \
  --network_dim=8 \
  --network_alpha=8 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW" \
  --lr_scheduler="constant" \
  --timestep_sampling="nextdit_shift" \
  --discrete_flow_shift=6.0 \
  --model_prediction_type="raw" \
  --system_prompt="You are an assistant designed to generate high-quality images based on user prompts." \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --cache_latents \
  --cache_text_encoder_outputs

System prompts

Lumina Image 2.0 is conditioned on a system prompt in addition to the image caption. You must provide a system prompt during training to match the model’s pre-training setup.

General purpose
High image-text alignment

--system_prompt="You are an assistant designed to generate high-quality images based on user prompts."

--system_prompt="You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts."

Key training parameters

Parameter	Description	Default	Recommendation
`--network_module`	Network module	—	`networks.lora_lumina`
`--timestep_sampling`	Timestep sampling method	`shift`	`nextdit_shift`
`--discrete_flow_shift`	Euler Discrete Scheduler shift	`6.0`	`6.0`
`--model_prediction_type`	Prediction processing	`raw`	`raw`
`--mixed_precision`	Mixed precision dtype	—	`bf16`
`--gemma2_max_token_length`	Gemma2 max token length	`256`	`256`

Use --mixed_precision="bf16" for Lumina training. The model was pre-trained in bfloat16 and fp16 can be less stable.

Per-component LoRA rank control

Use --network_args to set different ranks for each model component:

--network_args \
  "attn_dim=8" \
  "mlp_dim=4" \
  "mod_dim=4" \
  "refiner_dim=4" \
  "embedder_dims=[4,4,4]"

The three values in embedder_dims correspond to: x_embedder, t_embedder, and caption_embedder.

Memory optimization

Option	Effect
`--blocks_to_swap=<n>`	Offloads `n` Transformer blocks to CPU
`--cache_text_encoder_outputs`	Caches Gemma2 outputs
`--cache_latents` / `--cache_latents_to_disk`	Caches AE outputs
`--fp8_base`	Trains the base model in FP8 precision
`--use_flash_attn`	Enables Flash Attention (requires `pip install flash-attn`)
`--use_sage_attn`	Enables Sage Attention

Inference

After training, use lumina_minimal_inference.py to generate images with your LoRA:

python lumina_minimal_inference.py \
  --pretrained_model_name_or_path "lumina-image-2.safetensors" \
  --gemma2_path "gemma_2_2b_fp16.safetensors" \
  --ae_path "ae.safetensors" \
  --output_dir "./outputs" \
  --offload \
  --seed 1234 \
  --prompt "A mountain landscape at sunset" \
  --system_prompt "You are an assistant designed to generate high-quality images based on user prompts." \
  --lora_weights "my_lumina_lora.safetensors;1.0"

Incompatible options

The following arguments are for SD 1.x/2.x and must not be used for Lumina:

--v2, --v_parameterization, --clip_skip

Supported Models

Network Modules

Architecture

Required model files

Available training methods

LoRA training

System prompts

Key training parameters

Per-component LoRA rank control

Memory optimization

Inference

Incompatible options

Build docs developers (and LLMs) love

Supported Models

Network Modules

​Architecture

​Required model files

​Available training methods

​LoRA training

​System prompts

​Key training parameters

​Per-component LoRA rank control

​Memory optimization

​Inference

​Incompatible options

Build docs developers (and LLMs) love

Architecture

Required model files

Available training methods

LoRA training

System prompts

Key training parameters

Per-component LoRA rank control

Memory optimization

Inference

Incompatible options