Skip to main content
This page covers advanced training topics: resuming from checkpoints, early stopping, multi-GPU DDP training, and memory optimization with gradient checkpointing.
All examples on this page use the RFDETR.train() high-level API. For custom callbacks, non-default loggers, and fine-grained distributed training control, use the Custom Training API with PyTorch Lightning primitives directly.

Resume training

Resume training from a previously saved checkpoint by passing the path to checkpoint.pth via the resume argument. This restores the full training state so you can continue exactly where you left off. The training loop automatically loads:
  • Model weights
  • Optimizer state
  • Learning rate scheduler state
  • Training epoch number
1

Run initial training

Start a training run and let it produce at least one checkpoint:
from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4,
    output_dir="output",
)
2

Resume from checkpoint

Pass the checkpoint path to continue training:
from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4,
    output_dir="output",
    resume="output/checkpoint.pth",
)
  • Use resume="checkpoint.pth" to continue training with full optimizer state restored.
  • Use pretrain_weights="checkpoint_best_total.pth" when initializing a model to start fresh training from those weights — this does not restore optimizer or scheduler state.

Early stopping

Early stopping monitors validation mAP and halts training if improvements stay below a threshold for a set number of epochs. This prevents wasted compute once the model has converged.
from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4,
    output_dir="output",
    early_stopping=True,
)

Configuration options

ParameterDefaultDescription
early_stopping_patience10Number of epochs without improvement before stopping
early_stopping_min_delta0.001Minimum mAP change to count as improvement
early_stopping_use_emaFalseUse EMA model’s mAP for comparisons

Advanced example

model.train(
    dataset_dir="path/to/dataset",
    epochs=200,
    early_stopping=True,
    early_stopping_patience=15,      # wait 15 epochs before stopping
    early_stopping_min_delta=0.005,  # require 0.5% mAP improvement
    early_stopping_use_ema=True,     # track EMA model performance
)

How it works

  1. After each epoch, validation mAP is computed.
  2. If mAP improves by at least min_delta, the patience counter resets.
  3. If mAP doesn’t improve, the patience counter increments.
  4. When the patience counter reaches patience, training stops.
  5. The best checkpoint is already saved as checkpoint_best_total.pth.
Epoch 10: mAP = 0.450 (best: 0.450) - counter: 0
Epoch 11: mAP = 0.455 (best: 0.455) - counter: 0 (improved)
Epoch 12: mAP = 0.454 (best: 0.455) - counter: 1 (no improvement)
Epoch 13: mAP = 0.453 (best: 0.455) - counter: 2
...
Epoch 22: mAP = 0.452 (best: 0.455) - counter: 10 → STOP

Multi-GPU training

RF-DETR’s training stack is built on PyTorch Lightning, so multi-GPU and multi-node training use the Lightning Trainer strategies directly.

Using RFDETR.train() with multiple GPUs

Create a training script and launch it with torchrun:
# train.py
from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="path/to/dataset",
    epochs=100,
    batch_size=4,   # per-GPU batch size
    grad_accum_steps=1,
    lr=1e-4,
    output_dir="output",
    devices="auto", # required — see warning below
)
torchrun --nproc_per_node=4 train.py
build_trainer() defaults to devices=1. Without overriding this, training silently runs on a single GPU even when torchrun launches multiple processes. Pass devices="auto" to use all GPUs visible to the process, or pass an explicit integer (e.g. devices=4).

Batch size with multiple GPUs

When using multiple GPUs, effective batch size is multiplied by the number of GPUs:
effective_batch_size = batch_size × grad_accum_steps × num_gpus
Configurations for an effective batch size of 16:
GPUsbatch_sizegrad_accum_stepsEffective
14416
24216
44116
82116
When switching between single and multi-GPU training, remember to adjust batch_size and grad_accum_steps to maintain the same effective batch size.

Multi-node training

For training across multiple machines, pass the standard torchrun flags:
torchrun \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="192.168.1.1" \
    --master_port=1234 \
    train.py
Run this command on each node, changing --node_rank accordingly.

Memory optimization

Gradient checkpointing

For large models or high resolutions, enable gradient checkpointing to trade compute for memory. Pass it as a model constructor argument, not as a training parameter:
from rfdetr import RFDETRMedium

# Enable gradient checkpointing at model construction time
model = RFDETRMedium(gradient_checkpointing=True)

model.train(
    dataset_dir="path/to/dataset",
    batch_size=2,
)
This re-computes activations during the backward pass instead of storing them, reducing memory usage by ~30–40% at the cost of ~20% slower training.

Memory-efficient configurations

gradient_checkpointing and resolution are model constructor arguments. batch_size and grad_accum_steps are training parameters.
Memory levelModel constructorTraining params
Very low (8 GB)gradient_checkpointing=True, resolution=560batch_size=1, grad_accum_steps=16
Low (12 GB)gradient_checkpointing=Truebatch_size=2, grad_accum_steps=8
Medium (16 GB)(defaults)batch_size=4, grad_accum_steps=4
High (24 GB)(defaults)batch_size=8, grad_accum_steps=2
Very high (40 GB+)resolution=784batch_size=16, grad_accum_steps=1
Example:
model = RFDETRMedium(gradient_checkpointing=True, resolution=560)
model.train(dataset_dir="path/to/dataset", batch_size=1, grad_accum_steps=16)

Training tips

Learning rate tuning

  • Fine-tuning from COCO weights (default): Use default learning rates (lr=1e-4, lr_encoder=1.5e-4)
  • Small dataset (<1000 images): Consider lower lr (e.g., 5e-5) to prevent overfitting
  • Large dataset (>10000 images): May benefit from higher lr (e.g., 2e-4)

Epoch count

Dataset sizeRecommended epochs
<500 images100–200
500–2000 images50–100
2000–10000 images30–50
>10000 images20–30
Use early stopping to automatically determine the optimal stopping point.

Troubleshooting

Out of memory (OOM)

If you encounter CUDA out of memory errors:
  1. Reduce batch_size
  2. Enable gradient_checkpointing=True
  3. Reduce resolution
  4. Increase grad_accum_steps to maintain effective batch size

Training too slow

  1. Increase batch_size (if memory allows)
  2. Use multiple GPUs with DDP
  3. Ensure you’re using GPU (check device="cuda")
  4. Consider using a smaller model variant (e.g., RFDETRSmall instead of RFDETRLarge)

Loss not decreasing

  1. Check that your dataset is correctly formatted
  2. Verify annotations are correct (bounding boxes in the correct format)
  3. Try reducing the learning rate
  4. Check for class imbalance in your dataset

Build docs developers (and LLMs) love