Skip to main content
Textual Inversion teaches Stable Diffusion new concepts by learning new token embeddings in the text encoder. Unlike fine-tuning or LoRA, it never touches the U-Net or any model weights — it only optimizes a small set of embedding vectors that represent your concept.

How it works

When you train Textual Inversion, you:
  1. Choose a unique token string (e.g., mychar) that doesn’t exist in the model’s vocabulary.
  2. Initialize that token’s embedding vector from a known word (e.g., girl).
  3. Train the embedding so that, when you write mychar in a prompt, the model generates your concept.
The result is a tiny .safetensors file (kilobytes to a few megabytes) containing only the learned embedding vectors.

Fine-tuning vs LoRA vs Textual Inversion

Fine-tuningLoRATextual Inversion
What’s trainedAll model weightsAdapter networkEmbedding vectors only
File sizeSeveral GBFew MB to hundreds of MBKilobytes to a few MB
VRAM requirementHighMediumLow
Expressive powerHighestHighLimited to text-space concepts
Best forFull style overhaulCharacters, stylesSimple concepts, styles

Supported models

ScriptModels
train_textual_inversion.pyStable Diffusion 1.x and 2.x
sdxl_train_textual_inversion.pyStable Diffusion XL
Textual Inversion is not currently supported for FLUX, SD3, or Lumina. For those architectures, use LoRA or fine-tuning instead.

Dataset requirements

Textual Inversion typically needs fewer images than LoRA or fine-tuning. 5–20 images of the concept you want to teach is often sufficient, though more images and varied compositions improve generalization. Create a TOML dataset configuration file:
[general]
shuffle_caption = false
caption_extension = ".txt"
keep_tokens = 1

[[datasets]]
resolution = 512          # Use 1024 for SDXL
batch_size = 4            # Can be larger than LoRA training
enable_bucket = true

  [[datasets.subsets]]
  image_dir = "path/to/images"
  caption_extension = ".txt"
  num_repeats = 10
Caption guidelines: Every caption file must include your --token_string. For example, if your token is mychar, write captions like:
mychar, 1girl, blonde hair, blue eyes, smiling
The token string can appear anywhere in the caption but must be present. You can verify token recognition with --debug_dataset — look for token IDs ≥ 49408 (those are your new custom tokens).

Key arguments

--token_string
string
required
The unique token string for your concept. Must not already exist in the tokenizer’s vocabulary. Use this string in all your training captions (e.g., mychar 1girl).
--init_word
string
The word used to initialize the embedding vector. Choose something conceptually close to what you’re teaching (e.g., girl, dog, painting). Must resolve to a single token.
--num_vectors_per_token
integer
default:"1"
Number of embedding vectors to use for this token. More vectors give greater expressiveness but consume tokens from the 77-token prompt limit. Values of 2–4 are common.With --token_string=mychar and --num_vectors_per_token=4, the system creates: mychar, mychar1, mychar2, mychar3.
--weights
string
Path to an existing embedding file to resume training from.
--use_object_template
boolean
Ignore captions and use built-in object templates like "a photo of a {}". Matches the original Textual Inversion paper implementation. Good for characters and objects.
--use_style_template
boolean
Ignore captions and use built-in style templates like "a painting in the style of {}". Good for artistic styles.
ArgumentRecommended valueNotes
--learning_rate1e-6Lower than LoRA training; adjust based on results
--max_train_steps10002000Fewer steps needed vs fine-tuning
--optimizer_typeAdamW8bitMemory-efficient
--mixed_precisionfp16 or bf16Reduces VRAM usage
--cache_latentsPre-encode VAE outputs; reduces VRAM usage
--gradient_checkpointingAdditional VRAM savings
ArgumentDescription
--cache_text_encoder_outputsCache text encoder outputs to save VRAM
--mixed_precision bf16Use bf16 on RTX 30 series or later

Training commands

accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py \
  --pretrained_model_name_or_path="path/to/model.safetensors" \
  --dataset_config="dataset_config.toml" \
  --output_dir="output" \
  --output_name="my_textual_inversion" \
  --save_model_as="safetensors" \
  --token_string="mychar" \
  --init_word="girl" \
  --num_vectors_per_token=4 \
  --max_train_steps=1600 \
  --learning_rate=1e-6 \
  --optimizer_type="AdamW8bit" \
  --mixed_precision="fp16" \
  --cache_latents \
  --sdpa

Using the trained embedding

The trained embedding file is a .safetensors file saved to --output_dir. To use it:

AUTOMATIC1111 WebUI

Place the .safetensors file in the embeddings/ folder. Use the token string in your prompt.

ComfyUI

Load the embedding with the appropriate embedding node and use the token string in prompts.

Diffusers

Load the embedding file directly using the embedding path in your pipeline.
In your prompts, simply use the token string you trained — for example, "mychar standing in a park" — and the model will apply the learned embedding automatically.

Troubleshooting

Use a unique string that doesn’t appear in the model’s vocabulary. Try adding numbers or uncommon character combinations (e.g., mychar123, xyzperson).
  • Make sure every caption includes the token string.
  • Try a lower learning rate (e.g., 5e-7).
  • Increase the number of training steps.
  • Enable --cache_latents to stabilize training.
  • Reduce batch size in the dataset config.
  • Add --gradient_checkpointing.
  • Add --cache_latents.

Build docs developers (and LLMs) love