ControlNet Training

ControlNet adds spatial conditioning to diffusion models, letting you control outputs with structured inputs like edge maps, depth maps, and pose skeletons. Instead of a text prompt alone, you pair each training image with a conditioning image that encodes the spatial structure you want the model to learn to follow.

Common conditioning image types include:

Canny — edge detection (white lines on black background)
Depth — depth maps (lighter = closer)
Pose — skeletal pose estimation (e.g., OpenPose)
Segmentation — semantic segmentation maps
Scribble — rough sketch lines

Each conditioning type must be trained separately with its own matched dataset.

Two approaches

Standard ControlNet

Full ControlNet architecture. Higher capacity, larger output file. Uses train_controlnet.py (SD 1.x/2.x) or sdxl_train_control_net.py (SDXL).

ControlNet-LLLite

Lightweight “LoRA Like Lite” implementation. Smaller model, faster training, less VRAM. Currently SDXL only. Uses sdxl_train_control_net_lllite.py.

ControlNet-LLLite

ControlNet-LLLite is a lighter alternative to full ControlNet, inspired by LoRA’s adapter architecture. Each LLLite module consists of:

A conditioning image embedding that maps a conditioning image into latent space.
A small LoRA-like network added to the U-Net’s Linear and Conv layers (currently CrossAttention: attn1 q/k/v, attn2 q).

This design is much smaller than full ControlNet and works well for most conditioning types, though it has limited capacity for complex conditioning signals or non-standard art styles.

ControlNet-LLLite is an experimental implementation and may change significantly in future releases. It currently supports SDXL only.

Preparing the dataset

ControlNet-LLLite uses the DreamBooth-style dataset format. For each training image, you need a matching conditioning image with the same filename (base name without extension).

[general]
flip_aug = false
color_aug = false
resolution = [1024, 1024]

[[datasets]]
batch_size = 8
enable_bucket = false

  [[datasets.subsets]]
  image_dir = "path/to/training/images"
  caption_extension = ".txt"
  conditioning_data_dir = "path/to/conditioning/images"

The conditioning image must have the same base filename as the training image.
Conditioning images are automatically resized to match the training image.
Conditioning images do not need caption files.
random_crop is not supported for ControlNet-LLLite datasets.
The finetuning-method dataset format (in_json) is not supported; use the DreamBooth directory format.

Generating a synthetic dataset

The easiest way to build a dataset is to:

Generate training images

Use your base SDXL model to generate a diverse set of images at 1024×1024. Store them in a directory.

Process conditioning images

Apply your conditioning transform (e.g., Canny edge detection) to each generated image. Save results to a separate directory with the same filenames.Example Canny processing script:

import glob
import os
import cv2

IMAGES_DIR = "path/to/generated/images"
CANNY_DIR = "path/to/canny/images"

os.makedirs(CANNY_DIR, exist_ok=True)
for img_file in glob.glob(IMAGES_DIR + "/*.png"):
    out_file = os.path.join(CANNY_DIR, os.path.basename(img_file))
    if os.path.exists(out_file):
        continue
    img = cv2.imread(img_file)
    canny = cv2.Canny(img, 100, 200)
    cv2.imwrite(out_file, canny)

Create caption files

Create a .txt caption file for each training image. If you generated the images with sdxl_gen_img.py, you can extract captions from the image metadata:

import glob
import os
from PIL import Image

IMAGES_DIR = "path/to/generated/images"

for img_file in glob.glob(IMAGES_DIR + "/*.png"):
    cap_file = img_file.replace(".png", ".txt")
    if os.path.exists(cap_file):
        continue
    img = Image.open(img_file)
    prompt = img.text.get("prompt", "")
    with open(cap_file, "w") as f:
        f.write(prompt + "\n")

Training configuration

Use a .toml configuration file for your training run:

pretrained_model_name_or_path = "/path/to/sdxl_model.safetensors"
max_train_epochs = 12
max_data_loader_n_workers = 4
persistent_data_loader_workers = true
seed = 42
gradient_checkpointing = true
mixed_precision = "bf16"
save_precision = "bf16"
full_bf16 = true
optimizer_type = "adamw8bit"
learning_rate = 2e-4
xformers = true
output_dir = "/path/to/output/dir"
output_name = "my_canny_lllite"
save_every_n_epochs = 1
save_model_as = "safetensors"
vae_batch_size = 4
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
network_dim = 64
cond_emb_dim = 32
dataset_config = "/path/to/dataset.toml"

Key parameters:

Parameter	Description
`network_dim`	Rank of the LoRA-like module. `64` is the default for Canny; reduce to ~32 for simpler conditioning like depth.
`cond_emb_dim`	Dimension of the conditioning image embedding. `32` works well for Canny.
`full_bf16`	Enable full BFloat16 training (requires RTX 30 series or later). Recommended for 24 GB VRAM.
`cache_latents_to_disk`	Cache VAE latents to disk to free GPU memory during training.
`cache_text_encoder_outputs_to_disk`	Cache text encoder outputs to disk.

Training command

accelerate launch --mixed_precision bf16 sdxl_train_control_net_lllite.py \
  --config_file "lllite_config.toml"

Or pass all arguments directly:

accelerate launch --mixed_precision bf16 sdxl_train_control_net_lllite.py \
  --pretrained_model_name_or_path "sdxl_model.safetensors" \
  --dataset_config "dataset.toml" \
  --output_dir "output" \
  --output_name "canny_lllite" \
  --network_dim 64 \
  --cond_emb_dim 32 \
  --learning_rate 2e-4 \
  --optimizer_type adamw8bit \
  --max_train_epochs 12 \
  --mixed_precision bf16 \
  --full_bf16 \
  --gradient_checkpointing \
  --cache_latents \
  --cache_latents_to_disk \
  --cache_text_encoder_outputs \
  --cache_text_encoder_outputs_to_disk \
  --save_every_n_epochs 1 \
  --save_model_as safetensors

Standard ControlNet (SDXL)

For full ControlNet capacity, use sdxl_train_control_net.py. This produces a larger model with more representational power.

accelerate launch --mixed_precision bf16 sdxl_train_control_net.py \
  --pretrained_model_name_or_path "sdxl_model.safetensors" \
  --dataset_config "dataset.toml" \
  --output_dir "output" \
  --output_name "sdxl_controlnet_canny" \
  --learning_rate 1e-5 \
  --optimizer_type AdamW8bit \
  --max_train_epochs 10 \
  --mixed_precision bf16 \
  --gradient_checkpointing \
  --cache_latents \
  --save_every_n_epochs 1

Inference

To generate images using a trained LLLite model, use sdxl_gen_img.py:

python sdxl_gen_img.py \
  --ckpt path/to/sdxl.safetensors \
  --control_net_lllite_models path/to/canny_lllite.safetensors \
  --guide_image_path path/to/canny_image.png \
  --prompt "your prompt here" \
  --W 1024 --H 1024

The --guide_image_path must already be a processed conditioning image (e.g., a Canny-processed image with white edges on black background). The script does not apply any preprocessing to the guide image.

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

Two approaches

Standard ControlNet

ControlNet-LLLite

ControlNet-LLLite

Preparing the dataset

Generating a synthetic dataset

Training configuration

Training command

Standard ControlNet (SDXL)

Inference

Build docs developers (and LLMs) love

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

​Two approaches

Standard ControlNet

ControlNet-LLLite

​ControlNet-LLLite

​Preparing the dataset

​Generating a synthetic dataset

​Training configuration

​Training command

​Standard ControlNet (SDXL)

​Inference

Build docs developers (and LLMs) love

Two approaches

ControlNet-LLLite

Preparing the dataset

Generating a synthetic dataset

Training configuration

Training command

Standard ControlNet (SDXL)

Inference