Advanced Training

This page covers advanced training topics: resuming from checkpoints, early stopping, multi-GPU DDP training, and memory optimization with gradient checkpointing.

All examples on this page use the RFDETR.train() high-level API. For custom callbacks, non-default loggers, and fine-grained distributed training control, use the Custom Training API with PyTorch Lightning primitives directly.

Resume training

Resume training from a previously saved checkpoint by passing the path to checkpoint.pth via the resume argument. This restores the full training state so you can continue exactly where you left off. The training loop automatically loads:

Model weights
Optimizer state
Learning rate scheduler state
Training epoch number

Run initial training

Start a training run and let it produce at least one checkpoint:

from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4,
    output_dir="output",
)

Resume from checkpoint

Pass the checkpoint path to continue training:

Object detection
Image segmentation

from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4,
    output_dir="output",
    resume="output/checkpoint.pth",
)

from rfdetr import RFDETRSegMedium

model = RFDETRSegMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4,
    output_dir="output",
    resume="output/checkpoint.pth",
)

Use resume="checkpoint.pth" to continue training with full optimizer state restored.
Use pretrain_weights="checkpoint_best_total.pth" when initializing a model to start fresh training from those weights — this does not restore optimizer or scheduler state.

Early stopping

Early stopping monitors validation mAP and halts training if improvements stay below a threshold for a set number of epochs. This prevents wasted compute once the model has converged.

Object detection
Image segmentation

from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4,
    output_dir="output",
    early_stopping=True,
)

from rfdetr import RFDETRSegMedium

model = RFDETRSegMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4,
    output_dir="output",
    early_stopping=True,
)

Configuration options

Parameter	Default	Description
`early_stopping_patience`	`10`	Number of epochs without improvement before stopping
`early_stopping_min_delta`	`0.001`	Minimum mAP change to count as improvement
`early_stopping_use_ema`	`False`	Use EMA model’s mAP for comparisons

Advanced example

model.train(
    dataset_dir="path/to/dataset",
    epochs=200,
    early_stopping=True,
    early_stopping_patience=15,      # wait 15 epochs before stopping
    early_stopping_min_delta=0.005,  # require 0.5% mAP improvement
    early_stopping_use_ema=True,     # track EMA model performance
)

How it works

After each epoch, validation mAP is computed.
If mAP improves by at least min_delta, the patience counter resets.
If mAP doesn’t improve, the patience counter increments.
When the patience counter reaches patience, training stops.
The best checkpoint is already saved as checkpoint_best_total.pth.

Epoch 10: mAP = 0.450 (best: 0.450) - counter: 0
Epoch 11: mAP = 0.455 (best: 0.455) - counter: 0 (improved)
Epoch 12: mAP = 0.454 (best: 0.455) - counter: 1 (no improvement)
Epoch 13: mAP = 0.453 (best: 0.455) - counter: 2
...
Epoch 22: mAP = 0.452 (best: 0.455) - counter: 10 → STOP

Multi-GPU training

RF-DETR’s training stack is built on PyTorch Lightning, so multi-GPU and multi-node training use the Lightning Trainer strategies directly.

Using RFDETR.train() with multiple GPUs

Create a training script and launch it with torchrun:

# train.py
from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,   # per-GPU batch size
    grad_accum_steps=1,
    lr=1e-4,
    output_dir="output",
    devices="auto", # required — see warning below
)

torchrun --nproc_per_node=4 train.py

build_trainer() defaults to devices=1. Without overriding this, training silently runs on a single GPU even when torchrun launches multiple processes. Pass devices="auto" to use all GPUs visible to the process, or pass an explicit integer (e.g. devices=4).

Batch size with multiple GPUs

When using multiple GPUs, effective batch size is multiplied by the number of GPUs:

effective_batch_size = batch_size × grad_accum_steps × num_gpus

Configurations for an effective batch size of 16:

GPUs	`batch_size`	`grad_accum_steps`	Effective
1	4	4	16
2	4	2	16
4	4	1	16
8	2	1	16

When switching between single and multi-GPU training, remember to adjust batch_size and grad_accum_steps to maintain the same effective batch size.

Multi-node training

For training across multiple machines, pass the standard torchrun flags:

torchrun \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="192.168.1.1" \
    --master_port=1234 \
    train.py

Run this command on each node, changing --node_rank accordingly.

Memory optimization

Gradient checkpointing

For large models or high resolutions, enable gradient checkpointing to trade compute for memory. Pass it as a model constructor argument, not as a training parameter:

from rfdetr import RFDETRMedium

# Enable gradient checkpointing at model construction time
model = RFDETRMedium(gradient_checkpointing=True)

model.train(
    dataset_dir="path/to/dataset",
    batch_size=2,
)

This re-computes activations during the backward pass instead of storing them, reducing memory usage by ~30–40% at the cost of ~20% slower training.

Memory-efficient configurations

gradient_checkpointing and resolution are model constructor arguments. batch_size and grad_accum_steps are training parameters.

Memory level	Model constructor	Training params
Very low (8 GB)	`gradient_checkpointing=True, resolution=560`	`batch_size=1, grad_accum_steps=16`
Low (12 GB)	`gradient_checkpointing=True`	`batch_size=2, grad_accum_steps=8`
Medium (16 GB)	(defaults)	`batch_size=4, grad_accum_steps=4`
High (24 GB)	(defaults)	`batch_size=8, grad_accum_steps=2`
Very high (40 GB+)	`resolution=784`	`batch_size=16, grad_accum_steps=1`

Example:

model = RFDETRMedium(gradient_checkpointing=True, resolution=560)
model.train(dataset_dir="path/to/dataset", batch_size=1, grad_accum_steps=16)

Training tips

Learning rate tuning

Fine-tuning from COCO weights (default): Use default learning rates (lr=1e-4, lr_encoder=1.5e-4)
Small dataset (<1000 images): Consider lower lr (e.g., 5e-5) to prevent overfitting
Large dataset (>10000 images): May benefit from higher lr (e.g., 2e-4)

Epoch count

Dataset size	Recommended epochs
<500 images	100–200
500–2000 images	50–100
2000–10000 images	30–50
>10000 images	20–30

Use early stopping to automatically determine the optimal stopping point.

Troubleshooting

Out of memory (OOM)

If you encounter CUDA out of memory errors:

Reduce batch_size
Enable gradient_checkpointing=True
Reduce resolution
Increase grad_accum_steps to maintain effective batch size

Training too slow

Increase batch_size (if memory allows)
Use multiple GPUs with DDP
Ensure you’re using GPU (check device="cuda")
Consider using a smaller model variant (e.g., RFDETRSmall instead of RFDETRLarge)

Loss not decreasing

Check that your dataset is correctly formatted
Verify annotations are correct (bounding boxes in the correct format)
Try reducing the learning rate
Check for class imbalance in your dataset

Get Started

Run Models

Train Models

Deploy & Export

Advanced Training

Resume training

Early stopping

Configuration options

Advanced example

How it works

Multi-GPU training

Using RFDETR.train() with multiple GPUs

Batch size with multiple GPUs

Multi-node training

Memory optimization

Gradient checkpointing

Memory-efficient configurations

Training tips

Learning rate tuning

Epoch count

Troubleshooting

Out of memory (OOM)

Training too slow

Loss not decreasing

Build docs developers (and LLMs) love

Get Started

Run Models

Train Models

Deploy & Export

​Resume training

​Early stopping

​Configuration options

​Advanced example

​How it works

​Multi-GPU training

​Using RFDETR.train() with multiple GPUs

​Batch size with multiple GPUs

​Multi-node training

​Memory optimization

​Gradient checkpointing

​Memory-efficient configurations

​Training tips

​Learning rate tuning

​Epoch count

​Troubleshooting

​Out of memory (OOM)

​Training too slow

​Loss not decreasing

Build docs developers (and LLMs) love

Resume training

Early stopping

Configuration options

Advanced example

How it works

Multi-GPU training

Using RFDETR.train() with multiple GPUs

Batch size with multiple GPUs

Multi-node training

Memory optimization

Gradient checkpointing

Memory-efficient configurations

Training tips

Learning rate tuning

Epoch count

Troubleshooting

Out of memory (OOM)

Training too slow

Loss not decreasing