Textual Inversion

Textual Inversion teaches Stable Diffusion new concepts by learning new token embeddings in the text encoder. Unlike fine-tuning or LoRA, it never touches the U-Net or any model weights — it only optimizes a small set of embedding vectors that represent your concept.

How it works

When you train Textual Inversion, you:

Choose a unique token string (e.g., mychar) that doesn’t exist in the model’s vocabulary.
Initialize that token’s embedding vector from a known word (e.g., girl).
Train the embedding so that, when you write mychar in a prompt, the model generates your concept.

The result is a tiny .safetensors file (kilobytes to a few megabytes) containing only the learned embedding vectors.

Fine-tuning vs LoRA vs Textual Inversion

	Fine-tuning	LoRA	Textual Inversion
What’s trained	All model weights	Adapter network	Embedding vectors only
File size	Several GB	Few MB to hundreds of MB	Kilobytes to a few MB
VRAM requirement	High	Medium	Low
Expressive power	Highest	High	Limited to text-space concepts
Best for	Full style overhaul	Characters, styles	Simple concepts, styles

Supported models

Script	Models
`train_textual_inversion.py`	Stable Diffusion 1.x and 2.x
`sdxl_train_textual_inversion.py`	Stable Diffusion XL

Textual Inversion is not currently supported for FLUX, SD3, or Lumina. For those architectures, use LoRA or fine-tuning instead.

Dataset requirements

Textual Inversion typically needs fewer images than LoRA or fine-tuning. 5–20 images of the concept you want to teach is often sufficient, though more images and varied compositions improve generalization. Create a TOML dataset configuration file:

[general]
shuffle_caption = false
caption_extension = ".txt"
keep_tokens = 1

[[datasets]]
resolution = 512          # Use 1024 for SDXL
batch_size = 4            # Can be larger than LoRA training
enable_bucket = true

  [[datasets.subsets]]
  image_dir = "path/to/images"
  caption_extension = ".txt"
  num_repeats = 10

Caption guidelines: Every caption file must include your --token_string. For example, if your token is mychar, write captions like:

mychar, 1girl, blonde hair, blue eyes, smiling

The token string can appear anywhere in the caption but must be present. You can verify token recognition with --debug_dataset — look for token IDs ≥ 49408 (those are your new custom tokens).

Key arguments

Textual Inversion-specific arguments

--token_string

string

required

The unique token string for your concept. Must not already exist in the tokenizer’s vocabulary. Use this string in all your training captions (e.g., mychar 1girl).

--init_word

string

The word used to initialize the embedding vector. Choose something conceptually close to what you’re teaching (e.g., girl, dog, painting). Must resolve to a single token.

--num_vectors_per_token

integer

default:"1"

Number of embedding vectors to use for this token. More vectors give greater expressiveness but consume tokens from the 77-token prompt limit. Values of 2–4 are common.With --token_string=mychar and --num_vectors_per_token=4, the system creates: mychar, mychar1, mychar2, mychar3.

--weights

string

Path to an existing embedding file to resume training from.

--use_object_template

boolean

Ignore captions and use built-in object templates like "a photo of a {}". Matches the original Textual Inversion paper implementation. Good for characters and objects.

--use_style_template

boolean

Ignore captions and use built-in style templates like "a painting in the style of {}". Good for artistic styles.

Training parameters

Argument	Recommended value	Notes
`--learning_rate`	`1e-6`	Lower than LoRA training; adjust based on results
`--max_train_steps`	`1000`–`2000`	Fewer steps needed vs fine-tuning
`--optimizer_type`	`AdamW8bit`	Memory-efficient
`--mixed_precision`	`fp16` or `bf16`	Reduces VRAM usage
`--cache_latents`	—	Pre-encode VAE outputs; reduces VRAM usage
`--gradient_checkpointing`	—	Additional VRAM savings

SDXL-specific memory options

Argument	Description
`--cache_text_encoder_outputs`	Cache text encoder outputs to save VRAM
`--mixed_precision bf16`	Use bf16 on RTX 30 series or later

Training commands

SD 1.x / 2.x
SDXL

accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py \
  --pretrained_model_name_or_path="path/to/model.safetensors" \
  --dataset_config="dataset_config.toml" \
  --output_dir="output" \
  --output_name="my_textual_inversion" \
  --save_model_as="safetensors" \
  --token_string="mychar" \
  --init_word="girl" \
  --num_vectors_per_token=4 \
  --max_train_steps=1600 \
  --learning_rate=1e-6 \
  --optimizer_type="AdamW8bit" \
  --mixed_precision="fp16" \
  --cache_latents \
  --sdpa

accelerate launch --num_cpu_threads_per_process 1 sdxl_train_textual_inversion.py \
  --pretrained_model_name_or_path="path/to/sdxl_model.safetensors" \
  --dataset_config="dataset_config.toml" \
  --output_dir="output" \
  --output_name="my_sdxl_textual_inversion" \
  --save_model_as="safetensors" \
  --token_string="mychar" \
  --init_word="girl" \
  --num_vectors_per_token=4 \
  --max_train_steps=1600 \
  --learning_rate=1e-6 \
  --optimizer_type="AdamW8bit" \
  --mixed_precision="fp16" \
  --cache_latents \
  --cache_text_encoder_outputs \
  --sdpa

Using the trained embedding

The trained embedding file is a .safetensors file saved to --output_dir. To use it:

AUTOMATIC1111 WebUI

Place the .safetensors file in the embeddings/ folder. Use the token string in your prompt.

ComfyUI

Load the embedding with the appropriate embedding node and use the token string in prompts.

Diffusers

Load the embedding file directly using the embedding path in your pipeline.

In your prompts, simply use the token string you trained — for example, "mychar standing in a park" — and the model will apply the learned embedding automatically.

Troubleshooting

Token string already exists in tokenizer

Use a unique string that doesn’t appear in the model’s vocabulary. Try adding numbers or uncommon character combinations (e.g., mychar123, xyzperson).

No improvement after training

Make sure every caption includes the token string.
Try a lower learning rate (e.g., 5e-7).
Increase the number of training steps.
Enable --cache_latents to stabilize training.

Out-of-memory errors

Reduce batch size in the dataset config.
Add --gradient_checkpointing.
Add --cache_latents.

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

How it works

Fine-tuning vs LoRA vs Textual Inversion

Supported models

Dataset requirements

Key arguments

Training commands

Using the trained embedding

AUTOMATIC1111 WebUI

ComfyUI

Diffusers

Troubleshooting

Build docs developers (and LLMs) love

Getting Started

Dataset Preparation

LoRA Training

Fine-tuning & Other Methods

Inference & Utilities

​How it works

​Fine-tuning vs LoRA vs Textual Inversion

​Supported models

​Dataset requirements

​Key arguments

​Training commands

​Using the trained embedding

AUTOMATIC1111 WebUI

ComfyUI

Diffusers

​Troubleshooting

Build docs developers (and LLMs) love

How it works

Fine-tuning vs LoRA vs Textual Inversion

Supported models

Dataset requirements

Key arguments

Training commands

Using the trained embedding

Troubleshooting