Skip to main content

What is validation loss?

Validation loss measures how well your model performs on data it has never seen during training. By holding out a small portion of your dataset and evaluating it periodically, you can detect overfitting — the point at which the model memorizes the training images instead of learning general concepts. Without validation loss you only have training loss, which will decrease continuously regardless of whether the model is actually improving or just memorizing. When you see training loss still falling but validation loss rising, the model has started to overfit and you should stop training or reduce your learning rate.
Every validation run uses the same random seed for noise generation and timestep selection. This makes the metric deterministic: any change in the validation loss reflects a real change in model weights, not random variance in the evaluation process.

How to enable validation

There are two ways to set up a validation split. The dataset config TOML approach is recommended because it gives you precise control over which images are used for validation.

Configuration options

ArgumentTOML keyDescription
--validation_split <f>validation_splitFraction of data to hold out for validation. The command-line argument applies globally; the TOML key applies per subset.
--validate_every_n_steps <n>Run a validation pass every N training steps.
--validate_every_n_epochs <n>Run a validation pass every N epochs. Defaults to once per epoch when not specified.
--max_validation_steps <n>Cap the number of validation batches per evaluation pass. Useful if your validation set is large. Omit to use the entire validation set.
--validation_seed <n>validation_seedSeed for validation dataloader shuffling. Falls back to the main training --seed when not set.

Complete example

1

Prepare your dataset config

Create a dataset_config.toml that separates training and validation images:
[general]
shuffle_caption = true
keep_tokens = 1

[[datasets]]
resolution = "1024,1024"
batch_size = 2

  [[datasets.subsets]]
  image_dir = "path/to/your_images"
  caption_extension = ".txt"
  num_repeats = 10

  [[datasets.subsets]]
  image_dir = "path/to/your_validation_images"
  caption_extension = ".txt"
  validation_split = 1.0
2

Run training

Launch training with logging enabled so you can view the validation metric in TensorBoard:
accelerate launch sdxl_train_network.py \
  --pretrained_model_name_or_path="sd_xl_base_1.0.safetensors" \
  --dataset_config="dataset_config.toml" \
  --output_dir="output" \
  --output_name="my_lora" \
  --network_module=networks.lora \
  --network_dim=32 \
  --network_alpha=16 \
  --save_every_n_epochs=1 \
  --learning_rate=1e-4 \
  --optimizer_type="AdamW8bit" \
  --mixed_precision="bf16" \
  --logging_dir=logs
3

Monitor in TensorBoard

Open TensorBoard and watch the loss/validation metric alongside loss/current:
tensorboard --logdir logs
Navigate to http://localhost:6006 in your browser. Look for loss/validation in the Scalars panel.

Reading the results

The validation loss is logged to TensorBoard (or Weights & Biases if you use --log_with=wandb) as loss/validation. A healthy training run looks like this:
  • Training loss decreases steadily over time.
  • Validation loss also decreases, tracking training loss loosely.
  • If validation loss starts rising while training loss keeps falling, the model is overfitting. Consider stopping training, reducing the learning rate, or increasing validation_split to give the model more validation signal.
Keep a separate folder of validation images that you do not include in the training subset. Images the model has never trained on give you a more honest overfitting signal than a randomly split subset that shares the same distribution.

Build docs developers (and LLMs) love