Common conditioning image types include:
- Canny — edge detection (white lines on black background)
- Depth — depth maps (lighter = closer)
- Pose — skeletal pose estimation (e.g., OpenPose)
- Segmentation — semantic segmentation maps
- Scribble — rough sketch lines
Two approaches
Standard ControlNet
Full ControlNet architecture. Higher capacity, larger output file. Uses
train_controlnet.py (SD 1.x/2.x) or sdxl_train_control_net.py (SDXL).ControlNet-LLLite
Lightweight “LoRA Like Lite” implementation. Smaller model, faster training, less VRAM. Currently SDXL only. Uses
sdxl_train_control_net_lllite.py.ControlNet-LLLite
ControlNet-LLLite is a lighter alternative to full ControlNet, inspired by LoRA’s adapter architecture. Each LLLite module consists of:- A conditioning image embedding that maps a conditioning image into latent space.
- A small LoRA-like network added to the U-Net’s Linear and Conv layers (currently CrossAttention:
attn1 q/k/v,attn2 q).
Preparing the dataset
ControlNet-LLLite uses the DreamBooth-style dataset format. For each training image, you need a matching conditioning image with the same filename (base name without extension).- The conditioning image must have the same base filename as the training image.
- Conditioning images are automatically resized to match the training image.
- Conditioning images do not need caption files.
random_cropis not supported for ControlNet-LLLite datasets.- The finetuning-method dataset format (
in_json) is not supported; use the DreamBooth directory format.
Generating a synthetic dataset
The easiest way to build a dataset is to:Generate training images
Use your base SDXL model to generate a diverse set of images at 1024×1024. Store them in a directory.
Process conditioning images
Apply your conditioning transform (e.g., Canny edge detection) to each generated image. Save results to a separate directory with the same filenames.Example Canny processing script:
Training configuration
Use a.toml configuration file for your training run:
| Parameter | Description |
|---|---|
network_dim | Rank of the LoRA-like module. 64 is the default for Canny; reduce to ~32 for simpler conditioning like depth. |
cond_emb_dim | Dimension of the conditioning image embedding. 32 works well for Canny. |
full_bf16 | Enable full BFloat16 training (requires RTX 30 series or later). Recommended for 24 GB VRAM. |
cache_latents_to_disk | Cache VAE latents to disk to free GPU memory during training. |
cache_text_encoder_outputs_to_disk | Cache text encoder outputs to disk. |
Training command
Standard ControlNet (SDXL)
For full ControlNet capacity, usesdxl_train_control_net.py. This produces a larger model with more representational power.
Inference
To generate images using a trained LLLite model, usesdxl_gen_img.py:
The
--guide_image_path must already be a processed conditioning image (e.g., a Canny-processed image with white edges on black background). The script does not apply any preprocessing to the guide image.