HunyuanImage 2.1

HunyuanImage-2.1 is a Diffusion Transformer (DiT) image generation model from Tencent. It uses two text encoders — a large vision-language model (Qwen2.5-VL) and a byte-level encoder (byT5) — alongside a dedicated VAE. sd-scripts supports LoRA training only; full model fine-tuning is not supported for this architecture.

Architecture

DiT (Diffusion Transformer) — replaces the UNet. Operates on patchified latents using transformer blocks with flow matching.
Qwen2.5-VL (7B) — a vision-language model used as the primary text encoder (bfloat16). Provides rich semantic understanding.
byT5 (small) — a byte-level T5 encoder used as an auxiliary text encoder (float16). Improves character-level and OCR fidelity.
VAE — HunyuanImage-2.1-specific VAE. Not compatible with SDXL, SD3, or FLUX.1 VAEs.

Full model fine-tuning is not supported for HunyuanImage-2.1. Only LoRA training is available. Additionally, LoRA modules for the text encoders are not supported — you must always pass --network_train_unet_only.

Required model files

Download the following files before training:

Component	File	Source
DiT	`dit/hunyuanimage2.1.safetensors`	tencent/HunyuanImage-2.1
Qwen2.5-VL (text encoder)	`split_files/text_encoders/qwen_2.5_vl_7b.safetensors`	Comfy-Org/HunyuanImage_2.1_ComfyUI
byT5 (text encoder)	`split_files/text_encoders/byt5_small_glyphxl_fp16.safetensors`	Comfy-Org/HunyuanImage_2.1_ComfyUI
VAE	`split_files/vae/hunyuan_image_2.1_vae_fp16.safetensors`	Comfy-Org/HunyuanImage_2.1_ComfyUI

Available training methods

Method	Script	Notes
LoRA	`hunyuan_image_train_network.py`	Only supported training method; uses `networks.lora_hunyuan_image`

LoRA training

Use hunyuan_image_train_network.py with --network_module=networks.lora_hunyuan_image. The --network_train_unet_only flag is required because text encoder LoRA is not supported:

accelerate launch --num_cpu_threads_per_process 1 hunyuan_image_train_network.py \
  --pretrained_model_name_or_path="hunyuanimage2.1.safetensors" \
  --text_encoder="qwen_2.5_vl_7b.safetensors" \
  --byt5="byt5_small_glyphxl_fp16.safetensors" \
  --vae="hunyuan_image_2.1_vae_fp16.safetensors" \
  --dataset_config="my_hunyuan_dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_hunyuan_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora_hunyuan_image \
  --network_dim=16 \
  --network_alpha=1 \
  --network_train_unet_only \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --attn_mode="torch" \
  --split_attn \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --model_prediction_type="raw" \
  --discrete_flow_shift=5.0 \
  --blocks_to_swap=18 \
  --cache_text_encoder_outputs \
  --cache_latents

VRAM optimization

HunyuanImage-2.1 is a large model. Use the following combinations based on your GPU:

GPU VRAM	Recommended settings
40 GB+	Standard settings (no special optimization needed)
24 GB	`--fp8_scaled --blocks_to_swap 9`
12 GB	`--fp8_scaled --blocks_to_swap 32`
8 GB	`--fp8_scaled --blocks_to_swap 37`

Key VRAM reduction options

--fp8_scaled — trains the DiT in scaled FP8 format. This is the recommended FP8 option for HunyuanImage-2.1 (replaces the unsupported --fp8_base). Essential for GPUs below 40 GB.
--fp8_vl — uses FP8 for the Qwen2.5-VL text encoder.
--blocks_to_swap <n> — offloads n DiT blocks to CPU. Up to 37 blocks can be swapped.
--text_encoder_cpu — runs both text encoders on CPU. Useful when VRAM is below 12 GB. Combine with --cache_text_encoder_outputs_to_disk to avoid re-encoding on every run. Also increase --num_cpu_threads_per_process in the accelerate launch command (e.g., 8 or 16) to speed up encoding.
--vae_chunk_size <n> — enables chunked VAE processing. A chunk size of 16 is recommended for low-VRAM environments.
--cpu_offload_checkpointing — offloads gradient checkpoints to CPU. Cannot be combined with --blocks_to_swap.

The Adafactor optimizer reduces VRAM consumption more than 8-bit AdamW:

--optimizer_type adafactor \
--optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
--lr_scheduler constant_with_warmup \
--max_grad_norm 0.0

Key training parameters

Parameter	Description	Default	Recommendation
`--network_module`	Network module	—	`networks.lora_hunyuan_image`
`--network_train_unet_only`	Train DiT only	—	Required
`--model_prediction_type`	Prediction processing	`raw`	`raw`
`--discrete_flow_shift`	Flow matching shift value	`5.0`	`5.0`
`--timestep_sampling`	Timestep sampling method	`sigma`	`sigma`
`--attn_mode`	Attention implementation	`torch`	`torch`
`--split_attn`	Process attention one item at a time	disabled	Recommended with `torch`

Attention modes

The --attn_mode option selects the attention implementation:

Mode	Library	Notes
`torch`	PyTorch (built-in)	Default; no additional install required
`xformers`	xformers	Requires `pip install xformers`; use with `--split_attn` for batch size > 1
`flash`	Flash Attention	Requires `pip install flash-attn`
`sageattn`	Sage Attention	Requires separate install

ComfyUI format conversion

LoRAs trained with sd-scripts use a format that differs slightly from ComfyUI’s expected format. Convert before use in ComfyUI:

python networks/convert_hunyuan_image_lora_to_comfy.py \
  path/to/my_hunyuan_lora.safetensors \
  path/to/my_hunyuan_lora_comfy.safetensors

To convert back from ComfyUI format to sd-scripts format, add --reverse:

python networks/convert_hunyuan_image_lora_to_comfy.py \
  --reverse \
  path/to/my_hunyuan_lora_comfy.safetensors \
  path/to/my_hunyuan_lora_sdscripts.safetensors

Reverse conversion only works for LoRAs that were originally converted by this script. LoRAs created with other training tools cannot be converted back.

Inference

Use hunyuan_image_minimal_inference.py to generate images with your trained LoRA:

python hunyuan_image_minimal_inference.py \
  --dit "hunyuanimage2.1.safetensors" \
  --text_encoder "qwen_2.5_vl_7b.safetensors" \
  --byt5 "byt5_small_glyphxl_fp16.safetensors" \
  --vae "hunyuan_image_2.1_vae_fp16.safetensors" \
  --lora_weight "my_hunyuan_lora.safetensors" \
  --lora_multiplier 1.0 \
  --attn_mode "torch" \
  --prompt "A cute cartoon penguin in a snowy landscape" \
  --image_size 2048 2048 \
  --infer_steps 50 \
  --guidance_scale 3.5 \
  --flow_shift 5.0 \
  --seed 42 \
  --save_path "output_image.png"

The most stable inference resolutions are: 2560×1536, 2304×1792, 2048×2048, 1792×2304, and 1536×2560.

Supported Models

Network Modules

Architecture

Required model files

Available training methods

LoRA training

VRAM optimization

Key VRAM reduction options

Key training parameters

Attention modes

ComfyUI format conversion

Inference

Build docs developers (and LLMs) love

Supported Models

Network Modules

​Architecture

​Required model files

​Available training methods

​LoRA training

​VRAM optimization

​Key VRAM reduction options

​Key training parameters

​Attention modes

​ComfyUI format conversion

​Inference

Build docs developers (and LLMs) love

Architecture

Required model files

Available training methods

LoRA training

VRAM optimization

Key VRAM reduction options

Key training parameters

Attention modes

ComfyUI format conversion

Inference