Skip to main content
ControlNet adds spatial conditioning to diffusion models, letting you control outputs with structured inputs like edge maps, depth maps, and pose skeletons. Instead of a text prompt alone, you pair each training image with a conditioning image that encodes the spatial structure you want the model to learn to follow.
Common conditioning image types include:
  • Canny — edge detection (white lines on black background)
  • Depth — depth maps (lighter = closer)
  • Pose — skeletal pose estimation (e.g., OpenPose)
  • Segmentation — semantic segmentation maps
  • Scribble — rough sketch lines
Each conditioning type must be trained separately with its own matched dataset.

Two approaches

Standard ControlNet

Full ControlNet architecture. Higher capacity, larger output file. Uses train_controlnet.py (SD 1.x/2.x) or sdxl_train_control_net.py (SDXL).

ControlNet-LLLite

Lightweight “LoRA Like Lite” implementation. Smaller model, faster training, less VRAM. Currently SDXL only. Uses sdxl_train_control_net_lllite.py.

ControlNet-LLLite

ControlNet-LLLite is a lighter alternative to full ControlNet, inspired by LoRA’s adapter architecture. Each LLLite module consists of:
  1. A conditioning image embedding that maps a conditioning image into latent space.
  2. A small LoRA-like network added to the U-Net’s Linear and Conv layers (currently CrossAttention: attn1 q/k/v, attn2 q).
This design is much smaller than full ControlNet and works well for most conditioning types, though it has limited capacity for complex conditioning signals or non-standard art styles.
ControlNet-LLLite is an experimental implementation and may change significantly in future releases. It currently supports SDXL only.

Preparing the dataset

ControlNet-LLLite uses the DreamBooth-style dataset format. For each training image, you need a matching conditioning image with the same filename (base name without extension).
[general]
flip_aug = false
color_aug = false
resolution = [1024, 1024]

[[datasets]]
batch_size = 8
enable_bucket = false

  [[datasets.subsets]]
  image_dir = "path/to/training/images"
  caption_extension = ".txt"
  conditioning_data_dir = "path/to/conditioning/images"
  • The conditioning image must have the same base filename as the training image.
  • Conditioning images are automatically resized to match the training image.
  • Conditioning images do not need caption files.
  • random_crop is not supported for ControlNet-LLLite datasets.
  • The finetuning-method dataset format (in_json) is not supported; use the DreamBooth directory format.

Generating a synthetic dataset

The easiest way to build a dataset is to:
1

Generate training images

Use your base SDXL model to generate a diverse set of images at 1024×1024. Store them in a directory.
2

Process conditioning images

Apply your conditioning transform (e.g., Canny edge detection) to each generated image. Save results to a separate directory with the same filenames.Example Canny processing script:
import glob
import os
import cv2

IMAGES_DIR = "path/to/generated/images"
CANNY_DIR = "path/to/canny/images"

os.makedirs(CANNY_DIR, exist_ok=True)
for img_file in glob.glob(IMAGES_DIR + "/*.png"):
    out_file = os.path.join(CANNY_DIR, os.path.basename(img_file))
    if os.path.exists(out_file):
        continue
    img = cv2.imread(img_file)
    canny = cv2.Canny(img, 100, 200)
    cv2.imwrite(out_file, canny)
3

Create caption files

Create a .txt caption file for each training image. If you generated the images with sdxl_gen_img.py, you can extract captions from the image metadata:
import glob
import os
from PIL import Image

IMAGES_DIR = "path/to/generated/images"

for img_file in glob.glob(IMAGES_DIR + "/*.png"):
    cap_file = img_file.replace(".png", ".txt")
    if os.path.exists(cap_file):
        continue
    img = Image.open(img_file)
    prompt = img.text.get("prompt", "")
    with open(cap_file, "w") as f:
        f.write(prompt + "\n")

Training configuration

Use a .toml configuration file for your training run:
pretrained_model_name_or_path = "/path/to/sdxl_model.safetensors"
max_train_epochs = 12
max_data_loader_n_workers = 4
persistent_data_loader_workers = true
seed = 42
gradient_checkpointing = true
mixed_precision = "bf16"
save_precision = "bf16"
full_bf16 = true
optimizer_type = "adamw8bit"
learning_rate = 2e-4
xformers = true
output_dir = "/path/to/output/dir"
output_name = "my_canny_lllite"
save_every_n_epochs = 1
save_model_as = "safetensors"
vae_batch_size = 4
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
network_dim = 64
cond_emb_dim = 32
dataset_config = "/path/to/dataset.toml"
Key parameters:
ParameterDescription
network_dimRank of the LoRA-like module. 64 is the default for Canny; reduce to ~32 for simpler conditioning like depth.
cond_emb_dimDimension of the conditioning image embedding. 32 works well for Canny.
full_bf16Enable full BFloat16 training (requires RTX 30 series or later). Recommended for 24 GB VRAM.
cache_latents_to_diskCache VAE latents to disk to free GPU memory during training.
cache_text_encoder_outputs_to_diskCache text encoder outputs to disk.

Training command

accelerate launch --mixed_precision bf16 sdxl_train_control_net_lllite.py \
  --config_file "lllite_config.toml"
Or pass all arguments directly:
accelerate launch --mixed_precision bf16 sdxl_train_control_net_lllite.py \
  --pretrained_model_name_or_path "sdxl_model.safetensors" \
  --dataset_config "dataset.toml" \
  --output_dir "output" \
  --output_name "canny_lllite" \
  --network_dim 64 \
  --cond_emb_dim 32 \
  --learning_rate 2e-4 \
  --optimizer_type adamw8bit \
  --max_train_epochs 12 \
  --mixed_precision bf16 \
  --full_bf16 \
  --gradient_checkpointing \
  --cache_latents \
  --cache_latents_to_disk \
  --cache_text_encoder_outputs \
  --cache_text_encoder_outputs_to_disk \
  --save_every_n_epochs 1 \
  --save_model_as safetensors

Standard ControlNet (SDXL)

For full ControlNet capacity, use sdxl_train_control_net.py. This produces a larger model with more representational power.
accelerate launch --mixed_precision bf16 sdxl_train_control_net.py \
  --pretrained_model_name_or_path "sdxl_model.safetensors" \
  --dataset_config "dataset.toml" \
  --output_dir "output" \
  --output_name "sdxl_controlnet_canny" \
  --learning_rate 1e-5 \
  --optimizer_type AdamW8bit \
  --max_train_epochs 10 \
  --mixed_precision bf16 \
  --gradient_checkpointing \
  --cache_latents \
  --save_every_n_epochs 1

Inference

To generate images using a trained LLLite model, use sdxl_gen_img.py:
python sdxl_gen_img.py \
  --ckpt path/to/sdxl.safetensors \
  --control_net_lllite_models path/to/canny_lllite.safetensors \
  --guide_image_path path/to/canny_image.png \
  --prompt "your prompt here" \
  --W 1024 --H 1024
The --guide_image_path must already be a processed conditioning image (e.g., a Canny-processed image with white edges on black background). The script does not apply any preprocessing to the guide image.

Build docs developers (and LLMs) love