Overview
Before training, each image in your dataset needs a caption or set of tags. Writing captions manually is time-consuming for large datasets. sd-scripts provides automated tagging scripts that generate per-image.txt caption files you can use directly with a DreamBooth-style or fine-tuning-style dataset config.
The primary tool is WD14 Tagger (finetune/tag_images_by_wd14_tagger.py), which uses SmilingWolf’s image-to-tag models to annotate images with Danbooru-style tags. For more natural-language descriptions, you can use the BLIP or GIT captioning scripts instead.
WD14 Tagger
Installation
- ONNX (recommended)
- TensorFlow
ONNX inference is faster and does not require a full TensorFlow installation. Install the runtime that matches your hardware:
Basic usage
Run the tagger on a directory of training images:wd14_tagger_model/. Subsequent runs reuse the cached files.
The script creates a .txt file next to each image with the same base name:
Available models
All SmilingWolf V2 and V3 models on Hugging Face are supported. Specify the repository ID with--repo_id.
| Model | Repo ID | Notes |
|---|---|---|
| SwinV2 V3 | SmilingWolf/wd-swinv2-tagger-v3 | Recommended; strong general performance |
| ViT V3 | SmilingWolf/wd-vit-tagger-v3 | Vision Transformer architecture |
| ViT Large V3 | SmilingWolf/wd-vit-large-tagger-v3 | Larger ViT; higher accuracy, slower |
| EVA02 Large V3 | SmilingWolf/wd-eva02-large-tagger-v3 | Highest accuracy; most VRAM required |
| ConvNext V2 (default) | SmilingWolf/wd-v1-4-convnext-tagger-v2 | Used when --repo_id is omitted |
The default model when
--repo_id is omitted is SmilingWolf/wd-v1-4-convnext-tagger-v2. For new projects, one of the V3 models is recommended.Common options
Use ONNX Runtime for inference. Omit to use TensorFlow instead.
Hugging Face repository ID for the tagger model. See the table above for available models.
Number of images to process per batch. Increase for faster tagging if you have sufficient VRAM (e.g.
4 or 8).Confidence threshold for outputting a tag. Lower values produce more tags with lower precision; higher values produce fewer tags with higher precision.
Confidence threshold specifically for general tags. Defaults to
--thresh if not set.Confidence threshold specifically for character tags. Defaults to
--thresh if not set.Replace underscores in tag names with spaces (e.g.
blue_eyes → blue eyes). Required for models trained with spaced tags like Animagine XL.Comma-separated list of tags to exclude from output. For example:
--undesired_tags "greyscale,monochrome".Process images in subdirectories as well. Use this when your training data spans multiple nested folders.
File extension for generated caption files.
Append new tags to existing caption files instead of overwriting them.
Directory to save downloaded model files.
Re-download model files even if they already exist locally.
Tag editing options
Prepend the image rating tag (e.g.
general, sensitive, explicit) to the output.Append the rating tag at the end of the output instead of the beginning.
Output character tags before general tags.
Split compound character tags like
chara_name_(series) into chara_name, series.Comma-separated list of tags to always output first when they appear. For example:
--always_first_tags "1girl,1boy".Replace specific tags in the output. Format:
original,replacement;original2,replacement2. Escape commas and semicolons with \ when needed.String used to separate tags in the output file.
Example: Animagine XL 3.1 format
Animagine XL 3.1 expects tags with spaces (not underscores), character tags first, and rating tags at the end. Run the following command (on a single line in practice):PUT,YOUR,UNDESIRED,TAGS with any tags you want to suppress, such as greyscale,monochrome,lowres.
BLIP and GIT captioning
For natural-language captions instead of Danbooru-style tags, use the BLIP or GIT captioning scripts included infinetune/.
- BLIP
- GIT
BLIP and GIT captions work well for general fine-tuning. For anime-style or illustration datasets, WD14 Tagger with Danbooru tags typically gives better training signal because the base model was trained on tag-captioned data.
After tagging
Once you have.txt files next to your images, reference the directory in your dataset config:
