HunyuanImage-2.1 is a Diffusion Transformer (DiT) image generation model from Tencent. It uses two text encoders — a large vision-language model (Qwen2.5-VL) and a byte-level encoder (byT5) — alongside a dedicated VAE. sd-scripts supports LoRA training only; full model fine-tuning is not supported for this architecture.
Architecture
- DiT (Diffusion Transformer) — replaces the UNet. Operates on patchified latents using transformer blocks with flow matching.
- Qwen2.5-VL (7B) — a vision-language model used as the primary text encoder (
bfloat16). Provides rich semantic understanding.
- byT5 (small) — a byte-level T5 encoder used as an auxiliary text encoder (
float16). Improves character-level and OCR fidelity.
- VAE — HunyuanImage-2.1-specific VAE. Not compatible with SDXL, SD3, or FLUX.1 VAEs.
Full model fine-tuning is not supported for HunyuanImage-2.1. Only LoRA training is available. Additionally, LoRA modules for the text encoders are not supported — you must always pass --network_train_unet_only.
Required model files
Download the following files before training:
| Component | File | Source |
|---|
| DiT | dit/hunyuanimage2.1.safetensors | tencent/HunyuanImage-2.1 |
| Qwen2.5-VL (text encoder) | split_files/text_encoders/qwen_2.5_vl_7b.safetensors | Comfy-Org/HunyuanImage_2.1_ComfyUI |
| byT5 (text encoder) | split_files/text_encoders/byt5_small_glyphxl_fp16.safetensors | Comfy-Org/HunyuanImage_2.1_ComfyUI |
| VAE | split_files/vae/hunyuan_image_2.1_vae_fp16.safetensors | Comfy-Org/HunyuanImage_2.1_ComfyUI |
Available training methods
| Method | Script | Notes |
|---|
| LoRA | hunyuan_image_train_network.py | Only supported training method; uses networks.lora_hunyuan_image |
LoRA training
Use hunyuan_image_train_network.py with --network_module=networks.lora_hunyuan_image. The --network_train_unet_only flag is required because text encoder LoRA is not supported:
accelerate launch --num_cpu_threads_per_process 1 hunyuan_image_train_network.py \
--pretrained_model_name_or_path="hunyuanimage2.1.safetensors" \
--text_encoder="qwen_2.5_vl_7b.safetensors" \
--byt5="byt5_small_glyphxl_fp16.safetensors" \
--vae="hunyuan_image_2.1_vae_fp16.safetensors" \
--dataset_config="my_hunyuan_dataset_config.toml" \
--output_dir="./output" \
--output_name="my_hunyuan_lora" \
--save_model_as=safetensors \
--network_module=networks.lora_hunyuan_image \
--network_dim=16 \
--network_alpha=1 \
--network_train_unet_only \
--learning_rate=1e-4 \
--optimizer_type="AdamW8bit" \
--lr_scheduler="constant" \
--attn_mode="torch" \
--split_attn \
--max_train_epochs=10 \
--save_every_n_epochs=1 \
--mixed_precision="bf16" \
--gradient_checkpointing \
--model_prediction_type="raw" \
--discrete_flow_shift=5.0 \
--blocks_to_swap=18 \
--cache_text_encoder_outputs \
--cache_latents
VRAM optimization
HunyuanImage-2.1 is a large model. Use the following combinations based on your GPU:
| GPU VRAM | Recommended settings |
|---|
| 40 GB+ | Standard settings (no special optimization needed) |
| 24 GB | --fp8_scaled --blocks_to_swap 9 |
| 12 GB | --fp8_scaled --blocks_to_swap 32 |
| 8 GB | --fp8_scaled --blocks_to_swap 37 |
Key VRAM reduction options
--fp8_scaled — trains the DiT in scaled FP8 format. This is the recommended FP8 option for HunyuanImage-2.1 (replaces the unsupported --fp8_base). Essential for GPUs below 40 GB.
--fp8_vl — uses FP8 for the Qwen2.5-VL text encoder.
--blocks_to_swap <n> — offloads n DiT blocks to CPU. Up to 37 blocks can be swapped.
--text_encoder_cpu — runs both text encoders on CPU. Useful when VRAM is below 12 GB. Combine with --cache_text_encoder_outputs_to_disk to avoid re-encoding on every run. Also increase --num_cpu_threads_per_process in the accelerate launch command (e.g., 8 or 16) to speed up encoding.
--vae_chunk_size <n> — enables chunked VAE processing. A chunk size of 16 is recommended for low-VRAM environments.
--cpu_offload_checkpointing — offloads gradient checkpoints to CPU. Cannot be combined with --blocks_to_swap.
The Adafactor optimizer reduces VRAM consumption more than 8-bit AdamW:--optimizer_type adafactor \
--optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
--lr_scheduler constant_with_warmup \
--max_grad_norm 0.0
Key training parameters
| Parameter | Description | Default | Recommendation |
|---|
--network_module | Network module | — | networks.lora_hunyuan_image |
--network_train_unet_only | Train DiT only | — | Required |
--model_prediction_type | Prediction processing | raw | raw |
--discrete_flow_shift | Flow matching shift value | 5.0 | 5.0 |
--timestep_sampling | Timestep sampling method | sigma | sigma |
--attn_mode | Attention implementation | torch | torch |
--split_attn | Process attention one item at a time | disabled | Recommended with torch |
Attention modes
The --attn_mode option selects the attention implementation:
| Mode | Library | Notes |
|---|
torch | PyTorch (built-in) | Default; no additional install required |
xformers | xformers | Requires pip install xformers; use with --split_attn for batch size > 1 |
flash | Flash Attention | Requires pip install flash-attn |
sageattn | Sage Attention | Requires separate install |
LoRAs trained with sd-scripts use a format that differs slightly from ComfyUI’s expected format. Convert before use in ComfyUI:
python networks/convert_hunyuan_image_lora_to_comfy.py \
path/to/my_hunyuan_lora.safetensors \
path/to/my_hunyuan_lora_comfy.safetensors
To convert back from ComfyUI format to sd-scripts format, add --reverse:
python networks/convert_hunyuan_image_lora_to_comfy.py \
--reverse \
path/to/my_hunyuan_lora_comfy.safetensors \
path/to/my_hunyuan_lora_sdscripts.safetensors
Reverse conversion only works for LoRAs that were originally converted by this script. LoRAs created with other training tools cannot be converted back.
Inference
Use hunyuan_image_minimal_inference.py to generate images with your trained LoRA:
python hunyuan_image_minimal_inference.py \
--dit "hunyuanimage2.1.safetensors" \
--text_encoder "qwen_2.5_vl_7b.safetensors" \
--byt5 "byt5_small_glyphxl_fp16.safetensors" \
--vae "hunyuan_image_2.1_vae_fp16.safetensors" \
--lora_weight "my_hunyuan_lora.safetensors" \
--lora_multiplier 1.0 \
--attn_mode "torch" \
--prompt "A cute cartoon penguin in a snowy landscape" \
--image_size 2048 2048 \
--infer_steps 50 \
--guidance_scale 3.5 \
--flow_shift 5.0 \
--seed 42 \
--save_path "output_image.png"
The most stable inference resolutions are: 2560×1536, 2304×1792, 2048×2048, 1792×2304, and 1536×2560.