Lumina Image 2.0 is a Next-generation Diffusion Transformer (Next-DiT) model. It uses a single Gemma2 language model as its text encoder and a dedicated AutoEncoder, making it architecturally distinct from Stable Diffusion and FLUX.1.
Architecture
- Next-DiT — Next-generation Diffusion Transformer architecture. Replaces the UNet with a transformer that natively handles variable-resolution inputs.
- Gemma2 (2B) — single text encoder based on Google’s Gemma2 language model. Handles both prompt encoding and system prompt conditioning.
- AutoEncoder (AE) — the same AE used by FLUX.1 (
ae.safetensors).
Required model files
Download the following files before training:
| Component | File | Source |
|---|
| Lumina Image 2.0 DiT | lumina-image-2.safetensors (full precision) or lumina_2_model_bf16.safetensors (bf16) | rockerBOO/lumina-image-2 / Comfy-Org/Lumina_Image_2.0_Repackaged |
| Gemma2 2B text encoder | gemma_2_2b_fp16.safetensors | Comfy-Org/Lumina_Image_2.0_Repackaged |
| AutoEncoder | ae.safetensors | Comfy-Org/Lumina_Image_2.0_Repackaged |
The AutoEncoder for Lumina Image 2.0 is the same file as the FLUX.1 AE. If you already have ae.safetensors from a FLUX.1 setup, you can reuse it here.
Available training methods
| Method | Script | Notes |
|---|
| LoRA | lumina_train_network.py | Uses networks.lora_lumina |
| Fine-tuning | lumina_train.py | Full model training |
LoRA training
Use lumina_train_network.py with --network_module=networks.lora_lumina:
accelerate launch --num_cpu_threads_per_process 1 lumina_train_network.py \
--pretrained_model_name_or_path="lumina-image-2.safetensors" \
--gemma2="gemma_2_2b_fp16.safetensors" \
--ae="ae.safetensors" \
--dataset_config="my_lumina_dataset_config.toml" \
--output_dir="./output" \
--output_name="my_lumina_lora" \
--save_model_as=safetensors \
--network_module=networks.lora_lumina \
--network_dim=8 \
--network_alpha=8 \
--learning_rate=1e-4 \
--optimizer_type="AdamW" \
--lr_scheduler="constant" \
--timestep_sampling="nextdit_shift" \
--discrete_flow_shift=6.0 \
--model_prediction_type="raw" \
--system_prompt="You are an assistant designed to generate high-quality images based on user prompts." \
--max_train_epochs=10 \
--save_every_n_epochs=1 \
--mixed_precision="bf16" \
--gradient_checkpointing \
--cache_latents \
--cache_text_encoder_outputs
System prompts
Lumina Image 2.0 is conditioned on a system prompt in addition to the image caption. You must provide a system prompt during training to match the model’s pre-training setup.
--system_prompt="You are an assistant designed to generate high-quality images based on user prompts."
--system_prompt="You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts."
Key training parameters
| Parameter | Description | Default | Recommendation |
|---|
--network_module | Network module | — | networks.lora_lumina |
--timestep_sampling | Timestep sampling method | shift | nextdit_shift |
--discrete_flow_shift | Euler Discrete Scheduler shift | 6.0 | 6.0 |
--model_prediction_type | Prediction processing | raw | raw |
--mixed_precision | Mixed precision dtype | — | bf16 |
--gemma2_max_token_length | Gemma2 max token length | 256 | 256 |
Use --mixed_precision="bf16" for Lumina training. The model was pre-trained in bfloat16 and fp16 can be less stable.
Per-component LoRA rank control
Use --network_args to set different ranks for each model component:
--network_args \
"attn_dim=8" \
"mlp_dim=4" \
"mod_dim=4" \
"refiner_dim=4" \
"embedder_dims=[4,4,4]"
The three values in embedder_dims correspond to: x_embedder, t_embedder, and caption_embedder.
Memory optimization
| Option | Effect |
|---|
--blocks_to_swap=<n> | Offloads n Transformer blocks to CPU |
--cache_text_encoder_outputs | Caches Gemma2 outputs |
--cache_latents / --cache_latents_to_disk | Caches AE outputs |
--fp8_base | Trains the base model in FP8 precision |
--use_flash_attn | Enables Flash Attention (requires pip install flash-attn) |
--use_sage_attn | Enables Sage Attention |
Inference
After training, use lumina_minimal_inference.py to generate images with your LoRA:
python lumina_minimal_inference.py \
--pretrained_model_name_or_path "lumina-image-2.safetensors" \
--gemma2_path "gemma_2_2b_fp16.safetensors" \
--ae_path "ae.safetensors" \
--output_dir "./outputs" \
--offload \
--seed 1234 \
--prompt "A mountain landscape at sunset" \
--system_prompt "You are an assistant designed to generate high-quality images based on user prompts." \
--lora_weights "my_lumina_lora.safetensors;1.0"
Incompatible options
The following arguments are for SD 1.x/2.x and must not be used for Lumina:
--v2, --v_parameterization, --clip_skip