Automatic Image Tagging

Overview

Before training, each image in your dataset needs a caption or set of tags. Writing captions manually is time-consuming for large datasets. sd-scripts provides automated tagging scripts that generate per-image .txt caption files you can use directly with a DreamBooth-style or fine-tuning-style dataset config. The primary tool is WD14 Tagger (finetune/tag_images_by_wd14_tagger.py), which uses SmilingWolf’s image-to-tag models to annotate images with Danbooru-style tags. For more natural-language descriptions, you can use the BLIP or GIT captioning scripts instead.

WD14 Tagger

Installation

ONNX (recommended)
TensorFlow

ONNX inference is faster and does not require a full TensorFlow installation. Install the runtime that matches your hardware:

# For GPU inference (recommended)
pip install onnx onnxruntime-gpu

# For CPU-only inference
pip install onnx onnxruntime

See the onnxruntime install docs for platform-specific instructions and CUDA version requirements.

If you prefer TensorFlow, install it separately according to your platform and GPU drivers. No additional packages beyond TensorFlow itself are required by the tagger script. Omit --onnx when running commands.

TensorFlow inference is slower than ONNX for most hardware configurations. ONNX is recommended unless you have a specific reason to use TensorFlow.

Basic usage

Run the tagger on a directory of training images:

python finetune/tag_images_by_wd14_tagger.py \
  --onnx \
  --repo_id SmilingWolf/wd-swinv2-tagger-v3 \
  --batch_size 4 \
  /path/to/train_data

On the first run, the model files are downloaded automatically to wd14_tagger_model/. Subsequent runs reuse the cached files. The script creates a .txt file next to each image with the same base name:

train_data/
├── img001.png
├── img001.txt    ← generated tag file
├── img002.jpg
└── img002.txt

Available models

All SmilingWolf V2 and V3 models on Hugging Face are supported. Specify the repository ID with --repo_id.

Model	Repo ID	Notes
SwinV2 V3	`SmilingWolf/wd-swinv2-tagger-v3`	Recommended; strong general performance
ViT V3	`SmilingWolf/wd-vit-tagger-v3`	Vision Transformer architecture
ViT Large V3	`SmilingWolf/wd-vit-large-tagger-v3`	Larger ViT; higher accuracy, slower
EVA02 Large V3	`SmilingWolf/wd-eva02-large-tagger-v3`	Highest accuracy; most VRAM required
ConvNext V2 (default)	`SmilingWolf/wd-v1-4-convnext-tagger-v2`	Used when `--repo_id` is omitted

The default model when --repo_id is omitted is SmilingWolf/wd-v1-4-convnext-tagger-v2. For new projects, one of the V3 models is recommended.

Common options

--onnx

flag

Use ONNX Runtime for inference. Omit to use TensorFlow instead.

--repo_id

string

default:"\"SmilingWolf/wd-v1-4-convnext-tagger-v2\""

Hugging Face repository ID for the tagger model. See the table above for available models.

--batch_size

number

default:"1"

Number of images to process per batch. Increase for faster tagging if you have sufficient VRAM (e.g. 4 or 8).

--thresh

number

default:"0.35"

Confidence threshold for outputting a tag. Lower values produce more tags with lower precision; higher values produce fewer tags with higher precision.

--general_threshold

number

Confidence threshold specifically for general tags. Defaults to --thresh if not set.

--character_threshold

number

Confidence threshold specifically for character tags. Defaults to --thresh if not set.

--remove_underscore

flag

Replace underscores in tag names with spaces (e.g. blue_eyes → blue eyes). Required for models trained with spaced tags like Animagine XL.

--undesired_tags

string

Comma-separated list of tags to exclude from output. For example: --undesired_tags "greyscale,monochrome".

--recursive

flag

Process images in subdirectories as well. Use this when your training data spans multiple nested folders.

--caption_extension

string

default:"\".txt\""

File extension for generated caption files.

--append_tags

flag

Append new tags to existing caption files instead of overwriting them.

--model_dir

string

default:"\"wd14_tagger_model\""

Directory to save downloaded model files.

--force_download

flag

Re-download model files even if they already exist locally.

Tag editing options

--use_rating_tags

flag

Prepend the image rating tag (e.g. general, sensitive, explicit) to the output.

--use_rating_tags_as_last_tag

flag

Append the rating tag at the end of the output instead of the beginning.

--character_tags_first

flag

Output character tags before general tags.

--character_tag_expand

flag

Split compound character tags like chara_name_(series) into chara_name, series.

--always_first_tags

string

Comma-separated list of tags to always output first when they appear. For example: --always_first_tags "1girl,1boy".

--tag_replacement

string

Replace specific tags in the output. Format: original,replacement;original2,replacement2. Escape commas and semicolons with \ when needed.

--caption_separator

string

default:"\", \""

String used to separate tags in the output file.

Example: Animagine XL 3.1 format

Animagine XL 3.1 expects tags with spaces (not underscores), character tags first, and rating tags at the end. Run the following command (on a single line in practice):

python finetune/tag_images_by_wd14_tagger.py \
  --onnx \
  --repo_id SmilingWolf/wd-swinv2-tagger-v3 \
  --batch_size 4 \
  --remove_underscore \
  --undesired_tags "PUT,YOUR,UNDESIRED,TAGS" \
  --recursive \
  --use_rating_tags_as_last_tag \
  --character_tags_first \
  --character_tag_expand \
  --always_first_tags "1girl,1boy" \
  /path/to/train_data

Replace PUT,YOUR,UNDESIRED,TAGS with any tags you want to suppress, such as greyscale,monochrome,lowres.

BLIP and GIT captioning

For natural-language captions instead of Danbooru-style tags, use the BLIP or GIT captioning scripts included in finetune/.

BLIP
GIT

python finetune/make_captions.py \
  --batch_size 4 \
  /path/to/train_data

BLIP generates sentence-style descriptions like “a girl with blue hair standing in front of a window”. Good for photographic or painterly subjects where natural-language descriptions are more useful than tag lists.

python finetune/make_captions_by_git.py \
  --batch_size 4 \
  /path/to/train_data

GIT (Generative Image-to-Text) is an alternative captioning model that also produces natural-language descriptions. Results vary by subject matter — try both and compare.

BLIP and GIT captions work well for general fine-tuning. For anime-style or illustration datasets, WD14 Tagger with Danbooru tags typically gives better training signal because the base model was trained on tag-captioned data.

After tagging

Once you have .txt files next to your images, reference the directory in your dataset config:

[[datasets]]
resolution = 1024
enable_bucket = true

  [[datasets.subsets]]
  image_dir = "/path/to/train_data"
  caption_extension = ".txt"
  # class_tokens is not needed when caption files exist for every image

See Dataset Configuration for the full list of options you can use to further control caption handling, shuffling, and augmentation.

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

Automatic Image Tagging

Overview

WD14 Tagger

Installation

Basic usage

Available models

Common options

Tag editing options

Example: Animagine XL 3.1 format

BLIP and GIT captioning

After tagging

Build docs developers (and LLMs) love

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

​Overview

​WD14 Tagger

​Installation

​Basic usage

​Available models

​Common options

​Tag editing options

​Example: Animagine XL 3.1 format

​BLIP and GIT captioning

​After tagging

Build docs developers (and LLMs) love

Overview

WD14 Tagger

Installation

Basic usage

Available models

Common options

Tag editing options

Example: Animagine XL 3.1 format

BLIP and GIT captioning

After tagging