Skip to main content
HunyuanImage-2.1 is a Diffusion Transformer (DiT) image generation model from Tencent. It uses two text encoders — a large vision-language model (Qwen2.5-VL) and a byte-level encoder (byT5) — alongside a dedicated VAE. sd-scripts supports LoRA training only; full model fine-tuning is not supported for this architecture.

Architecture

  • DiT (Diffusion Transformer) — replaces the UNet. Operates on patchified latents using transformer blocks with flow matching.
  • Qwen2.5-VL (7B) — a vision-language model used as the primary text encoder (bfloat16). Provides rich semantic understanding.
  • byT5 (small) — a byte-level T5 encoder used as an auxiliary text encoder (float16). Improves character-level and OCR fidelity.
  • VAE — HunyuanImage-2.1-specific VAE. Not compatible with SDXL, SD3, or FLUX.1 VAEs.
Full model fine-tuning is not supported for HunyuanImage-2.1. Only LoRA training is available. Additionally, LoRA modules for the text encoders are not supported — you must always pass --network_train_unet_only.

Required model files

Download the following files before training:
ComponentFileSource
DiTdit/hunyuanimage2.1.safetensorstencent/HunyuanImage-2.1
Qwen2.5-VL (text encoder)split_files/text_encoders/qwen_2.5_vl_7b.safetensorsComfy-Org/HunyuanImage_2.1_ComfyUI
byT5 (text encoder)split_files/text_encoders/byt5_small_glyphxl_fp16.safetensorsComfy-Org/HunyuanImage_2.1_ComfyUI
VAEsplit_files/vae/hunyuan_image_2.1_vae_fp16.safetensorsComfy-Org/HunyuanImage_2.1_ComfyUI

Available training methods

MethodScriptNotes
LoRAhunyuan_image_train_network.pyOnly supported training method; uses networks.lora_hunyuan_image

LoRA training

Use hunyuan_image_train_network.py with --network_module=networks.lora_hunyuan_image. The --network_train_unet_only flag is required because text encoder LoRA is not supported:
accelerate launch --num_cpu_threads_per_process 1 hunyuan_image_train_network.py \
  --pretrained_model_name_or_path="hunyuanimage2.1.safetensors" \
  --text_encoder="qwen_2.5_vl_7b.safetensors" \
  --byt5="byt5_small_glyphxl_fp16.safetensors" \
  --vae="hunyuan_image_2.1_vae_fp16.safetensors" \
  --dataset_config="my_hunyuan_dataset_config.toml" \
  --output_dir="./output" \
  --output_name="my_hunyuan_lora" \
  --save_model_as=safetensors \
  --network_module=networks.lora_hunyuan_image \
  --network_dim=16 \
  --network_alpha=1 \
  --network_train_unet_only \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --lr_scheduler="constant" \
  --attn_mode="torch" \
  --split_attn \
  --max_train_epochs=10 \
  --save_every_n_epochs=1 \
  --mixed_precision="bf16" \
  --gradient_checkpointing \
  --model_prediction_type="raw" \
  --discrete_flow_shift=5.0 \
  --blocks_to_swap=18 \
  --cache_text_encoder_outputs \
  --cache_latents

VRAM optimization

HunyuanImage-2.1 is a large model. Use the following combinations based on your GPU:
GPU VRAMRecommended settings
40 GB+Standard settings (no special optimization needed)
24 GB--fp8_scaled --blocks_to_swap 9
12 GB--fp8_scaled --blocks_to_swap 32
8 GB--fp8_scaled --blocks_to_swap 37

Key VRAM reduction options

  • --fp8_scaled — trains the DiT in scaled FP8 format. This is the recommended FP8 option for HunyuanImage-2.1 (replaces the unsupported --fp8_base). Essential for GPUs below 40 GB.
  • --fp8_vl — uses FP8 for the Qwen2.5-VL text encoder.
  • --blocks_to_swap <n> — offloads n DiT blocks to CPU. Up to 37 blocks can be swapped.
  • --text_encoder_cpu — runs both text encoders on CPU. Useful when VRAM is below 12 GB. Combine with --cache_text_encoder_outputs_to_disk to avoid re-encoding on every run. Also increase --num_cpu_threads_per_process in the accelerate launch command (e.g., 8 or 16) to speed up encoding.
  • --vae_chunk_size <n> — enables chunked VAE processing. A chunk size of 16 is recommended for low-VRAM environments.
  • --cpu_offload_checkpointing — offloads gradient checkpoints to CPU. Cannot be combined with --blocks_to_swap.
The Adafactor optimizer reduces VRAM consumption more than 8-bit AdamW:
--optimizer_type adafactor \
--optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
--lr_scheduler constant_with_warmup \
--max_grad_norm 0.0

Key training parameters

ParameterDescriptionDefaultRecommendation
--network_moduleNetwork modulenetworks.lora_hunyuan_image
--network_train_unet_onlyTrain DiT onlyRequired
--model_prediction_typePrediction processingrawraw
--discrete_flow_shiftFlow matching shift value5.05.0
--timestep_samplingTimestep sampling methodsigmasigma
--attn_modeAttention implementationtorchtorch
--split_attnProcess attention one item at a timedisabledRecommended with torch

Attention modes

The --attn_mode option selects the attention implementation:
ModeLibraryNotes
torchPyTorch (built-in)Default; no additional install required
xformersxformersRequires pip install xformers; use with --split_attn for batch size > 1
flashFlash AttentionRequires pip install flash-attn
sageattnSage AttentionRequires separate install

ComfyUI format conversion

LoRAs trained with sd-scripts use a format that differs slightly from ComfyUI’s expected format. Convert before use in ComfyUI:
python networks/convert_hunyuan_image_lora_to_comfy.py \
  path/to/my_hunyuan_lora.safetensors \
  path/to/my_hunyuan_lora_comfy.safetensors
To convert back from ComfyUI format to sd-scripts format, add --reverse:
python networks/convert_hunyuan_image_lora_to_comfy.py \
  --reverse \
  path/to/my_hunyuan_lora_comfy.safetensors \
  path/to/my_hunyuan_lora_sdscripts.safetensors
Reverse conversion only works for LoRAs that were originally converted by this script. LoRAs created with other training tools cannot be converted back.

Inference

Use hunyuan_image_minimal_inference.py to generate images with your trained LoRA:
python hunyuan_image_minimal_inference.py \
  --dit "hunyuanimage2.1.safetensors" \
  --text_encoder "qwen_2.5_vl_7b.safetensors" \
  --byt5 "byt5_small_glyphxl_fp16.safetensors" \
  --vae "hunyuan_image_2.1_vae_fp16.safetensors" \
  --lora_weight "my_hunyuan_lora.safetensors" \
  --lora_multiplier 1.0 \
  --attn_mode "torch" \
  --prompt "A cute cartoon penguin in a snowy landscape" \
  --image_size 2048 2048 \
  --infer_steps 50 \
  --guidance_scale 3.5 \
  --flow_shift 5.0 \
  --seed 42 \
  --save_path "output_image.png"
The most stable inference resolutions are: 2560×1536, 2304×1792, 2048×2048, 1792×2304, and 1536×2560.

Build docs developers (and LLMs) love