Skip to main content

Overview

Before training, each image in your dataset needs a caption or set of tags. Writing captions manually is time-consuming for large datasets. sd-scripts provides automated tagging scripts that generate per-image .txt caption files you can use directly with a DreamBooth-style or fine-tuning-style dataset config. The primary tool is WD14 Tagger (finetune/tag_images_by_wd14_tagger.py), which uses SmilingWolf’s image-to-tag models to annotate images with Danbooru-style tags. For more natural-language descriptions, you can use the BLIP or GIT captioning scripts instead.

WD14 Tagger

Installation

Basic usage

Run the tagger on a directory of training images:
python finetune/tag_images_by_wd14_tagger.py \
  --onnx \
  --repo_id SmilingWolf/wd-swinv2-tagger-v3 \
  --batch_size 4 \
  /path/to/train_data
On the first run, the model files are downloaded automatically to wd14_tagger_model/. Subsequent runs reuse the cached files. The script creates a .txt file next to each image with the same base name:
train_data/
├── img001.png
├── img001.txt    ← generated tag file
├── img002.jpg
└── img002.txt

Available models

All SmilingWolf V2 and V3 models on Hugging Face are supported. Specify the repository ID with --repo_id.
ModelRepo IDNotes
SwinV2 V3SmilingWolf/wd-swinv2-tagger-v3Recommended; strong general performance
ViT V3SmilingWolf/wd-vit-tagger-v3Vision Transformer architecture
ViT Large V3SmilingWolf/wd-vit-large-tagger-v3Larger ViT; higher accuracy, slower
EVA02 Large V3SmilingWolf/wd-eva02-large-tagger-v3Highest accuracy; most VRAM required
ConvNext V2 (default)SmilingWolf/wd-v1-4-convnext-tagger-v2Used when --repo_id is omitted
The default model when --repo_id is omitted is SmilingWolf/wd-v1-4-convnext-tagger-v2. For new projects, one of the V3 models is recommended.

Common options

--onnx
flag
Use ONNX Runtime for inference. Omit to use TensorFlow instead.
--repo_id
string
default:"\"SmilingWolf/wd-v1-4-convnext-tagger-v2\""
Hugging Face repository ID for the tagger model. See the table above for available models.
--batch_size
number
default:"1"
Number of images to process per batch. Increase for faster tagging if you have sufficient VRAM (e.g. 4 or 8).
--thresh
number
default:"0.35"
Confidence threshold for outputting a tag. Lower values produce more tags with lower precision; higher values produce fewer tags with higher precision.
--general_threshold
number
Confidence threshold specifically for general tags. Defaults to --thresh if not set.
--character_threshold
number
Confidence threshold specifically for character tags. Defaults to --thresh if not set.
--remove_underscore
flag
Replace underscores in tag names with spaces (e.g. blue_eyesblue eyes). Required for models trained with spaced tags like Animagine XL.
--undesired_tags
string
Comma-separated list of tags to exclude from output. For example: --undesired_tags "greyscale,monochrome".
--recursive
flag
Process images in subdirectories as well. Use this when your training data spans multiple nested folders.
--caption_extension
string
default:"\".txt\""
File extension for generated caption files.
--append_tags
flag
Append new tags to existing caption files instead of overwriting them.
--model_dir
string
default:"\"wd14_tagger_model\""
Directory to save downloaded model files.
--force_download
flag
Re-download model files even if they already exist locally.

Tag editing options

--use_rating_tags
flag
Prepend the image rating tag (e.g. general, sensitive, explicit) to the output.
--use_rating_tags_as_last_tag
flag
Append the rating tag at the end of the output instead of the beginning.
--character_tags_first
flag
Output character tags before general tags.
--character_tag_expand
flag
Split compound character tags like chara_name_(series) into chara_name, series.
--always_first_tags
string
Comma-separated list of tags to always output first when they appear. For example: --always_first_tags "1girl,1boy".
--tag_replacement
string
Replace specific tags in the output. Format: original,replacement;original2,replacement2. Escape commas and semicolons with \ when needed.
--caption_separator
string
default:"\", \""
String used to separate tags in the output file.

Example: Animagine XL 3.1 format

Animagine XL 3.1 expects tags with spaces (not underscores), character tags first, and rating tags at the end. Run the following command (on a single line in practice):
python finetune/tag_images_by_wd14_tagger.py \
  --onnx \
  --repo_id SmilingWolf/wd-swinv2-tagger-v3 \
  --batch_size 4 \
  --remove_underscore \
  --undesired_tags "PUT,YOUR,UNDESIRED,TAGS" \
  --recursive \
  --use_rating_tags_as_last_tag \
  --character_tags_first \
  --character_tag_expand \
  --always_first_tags "1girl,1boy" \
  /path/to/train_data
Replace PUT,YOUR,UNDESIRED,TAGS with any tags you want to suppress, such as greyscale,monochrome,lowres.

BLIP and GIT captioning

For natural-language captions instead of Danbooru-style tags, use the BLIP or GIT captioning scripts included in finetune/.
python finetune/make_captions.py \
  --batch_size 4 \
  /path/to/train_data
BLIP generates sentence-style descriptions like “a girl with blue hair standing in front of a window”. Good for photographic or painterly subjects where natural-language descriptions are more useful than tag lists.
BLIP and GIT captions work well for general fine-tuning. For anime-style or illustration datasets, WD14 Tagger with Danbooru tags typically gives better training signal because the base model was trained on tag-captioned data.

After tagging

Once you have .txt files next to your images, reference the directory in your dataset config:
[[datasets]]
resolution = 1024
enable_bucket = true

  [[datasets.subsets]]
  image_dir = "/path/to/train_data"
  caption_extension = ".txt"
  # class_tokens is not needed when caption files exist for every image
See Dataset Configuration for the full list of options you can use to further control caption handling, shuffling, and augmentation.

Build docs developers (and LLMs) love