This page covers advanced training topics: resuming from checkpoints, early stopping, multi-GPU DDP training, and memory optimization with gradient checkpointing.
All examples on this page use the RFDETR.train() high-level API. For custom callbacks, non-default loggers, and fine-grained distributed training control, use the Custom Training API with PyTorch Lightning primitives directly.
Resume training
Resume training from a previously saved checkpoint by passing the path to checkpoint.pth via the resume argument. This restores the full training state so you can continue exactly where you left off.
The training loop automatically loads:
- Model weights
- Optimizer state
- Learning rate scheduler state
- Training epoch number
Run initial training
Start a training run and let it produce at least one checkpoint:from rfdetr import RFDETRMedium
model = RFDETRMedium()
model.train(
dataset_dir="path/to/dataset",
epochs=100,
batch_size=4,
grad_accum_steps=4,
lr=1e-4,
output_dir="output",
)
Resume from checkpoint
Pass the checkpoint path to continue training: Object detection
Image segmentation
from rfdetr import RFDETRMedium
model = RFDETRMedium()
model.train(
dataset_dir="path/to/dataset",
epochs=100,
batch_size=4,
grad_accum_steps=4,
lr=1e-4,
output_dir="output",
resume="output/checkpoint.pth",
)
from rfdetr import RFDETRSegMedium
model = RFDETRSegMedium()
model.train(
dataset_dir="path/to/dataset",
epochs=100,
batch_size=4,
grad_accum_steps=4,
lr=1e-4,
output_dir="output",
resume="output/checkpoint.pth",
)
- Use
resume="checkpoint.pth" to continue training with full optimizer state restored.
- Use
pretrain_weights="checkpoint_best_total.pth" when initializing a model to start fresh training from those weights — this does not restore optimizer or scheduler state.
Early stopping
Early stopping monitors validation mAP and halts training if improvements stay below a threshold for a set number of epochs. This prevents wasted compute once the model has converged.
Object detection
Image segmentation
from rfdetr import RFDETRMedium
model = RFDETRMedium()
model.train(
dataset_dir="path/to/dataset",
epochs=100,
batch_size=4,
grad_accum_steps=4,
lr=1e-4,
output_dir="output",
early_stopping=True,
)
from rfdetr import RFDETRSegMedium
model = RFDETRSegMedium()
model.train(
dataset_dir="path/to/dataset",
epochs=100,
batch_size=4,
grad_accum_steps=4,
lr=1e-4,
output_dir="output",
early_stopping=True,
)
Configuration options
| Parameter | Default | Description |
|---|
early_stopping_patience | 10 | Number of epochs without improvement before stopping |
early_stopping_min_delta | 0.001 | Minimum mAP change to count as improvement |
early_stopping_use_ema | False | Use EMA model’s mAP for comparisons |
Advanced example
model.train(
dataset_dir="path/to/dataset",
epochs=200,
early_stopping=True,
early_stopping_patience=15, # wait 15 epochs before stopping
early_stopping_min_delta=0.005, # require 0.5% mAP improvement
early_stopping_use_ema=True, # track EMA model performance
)
How it works
- After each epoch, validation mAP is computed.
- If mAP improves by at least
min_delta, the patience counter resets.
- If mAP doesn’t improve, the patience counter increments.
- When the patience counter reaches
patience, training stops.
- The best checkpoint is already saved as
checkpoint_best_total.pth.
Epoch 10: mAP = 0.450 (best: 0.450) - counter: 0
Epoch 11: mAP = 0.455 (best: 0.455) - counter: 0 (improved)
Epoch 12: mAP = 0.454 (best: 0.455) - counter: 1 (no improvement)
Epoch 13: mAP = 0.453 (best: 0.455) - counter: 2
...
Epoch 22: mAP = 0.452 (best: 0.455) - counter: 10 → STOP
Multi-GPU training
RF-DETR’s training stack is built on PyTorch Lightning, so multi-GPU and multi-node training use the Lightning Trainer strategies directly.
Using RFDETR.train() with multiple GPUs
Create a training script and launch it with torchrun:
# train.py
from rfdetr import RFDETRMedium
model = RFDETRMedium()
model.train(
dataset_dir="path/to/dataset",
epochs=100,
batch_size=4, # per-GPU batch size
grad_accum_steps=1,
lr=1e-4,
output_dir="output",
devices="auto", # required — see warning below
)
torchrun --nproc_per_node=4 train.py
build_trainer() defaults to devices=1. Without overriding this, training silently runs on a single GPU even when torchrun launches multiple processes. Pass devices="auto" to use all GPUs visible to the process, or pass an explicit integer (e.g. devices=4).
Batch size with multiple GPUs
When using multiple GPUs, effective batch size is multiplied by the number of GPUs:
effective_batch_size = batch_size × grad_accum_steps × num_gpus
Configurations for an effective batch size of 16:
| GPUs | batch_size | grad_accum_steps | Effective |
|---|
| 1 | 4 | 4 | 16 |
| 2 | 4 | 2 | 16 |
| 4 | 4 | 1 | 16 |
| 8 | 2 | 1 | 16 |
When switching between single and multi-GPU training, remember to adjust batch_size and grad_accum_steps to maintain the same effective batch size.
Multi-node training
For training across multiple machines, pass the standard torchrun flags:
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--node_rank=0 \
--master_addr="192.168.1.1" \
--master_port=1234 \
train.py
Run this command on each node, changing --node_rank accordingly.
Memory optimization
Gradient checkpointing
For large models or high resolutions, enable gradient checkpointing to trade compute for memory. Pass it as a model constructor argument, not as a training parameter:
from rfdetr import RFDETRMedium
# Enable gradient checkpointing at model construction time
model = RFDETRMedium(gradient_checkpointing=True)
model.train(
dataset_dir="path/to/dataset",
batch_size=2,
)
This re-computes activations during the backward pass instead of storing them, reducing memory usage by ~30–40% at the cost of ~20% slower training.
Memory-efficient configurations
gradient_checkpointing and resolution are model constructor arguments. batch_size and grad_accum_steps are training parameters.
| Memory level | Model constructor | Training params |
|---|
| Very low (8 GB) | gradient_checkpointing=True, resolution=560 | batch_size=1, grad_accum_steps=16 |
| Low (12 GB) | gradient_checkpointing=True | batch_size=2, grad_accum_steps=8 |
| Medium (16 GB) | (defaults) | batch_size=4, grad_accum_steps=4 |
| High (24 GB) | (defaults) | batch_size=8, grad_accum_steps=2 |
| Very high (40 GB+) | resolution=784 | batch_size=16, grad_accum_steps=1 |
Example:
model = RFDETRMedium(gradient_checkpointing=True, resolution=560)
model.train(dataset_dir="path/to/dataset", batch_size=1, grad_accum_steps=16)
Training tips
Learning rate tuning
- Fine-tuning from COCO weights (default): Use default learning rates (
lr=1e-4, lr_encoder=1.5e-4)
- Small dataset (<1000 images): Consider lower
lr (e.g., 5e-5) to prevent overfitting
- Large dataset (>10000 images): May benefit from higher
lr (e.g., 2e-4)
Epoch count
| Dataset size | Recommended epochs |
|---|
| <500 images | 100–200 |
| 500–2000 images | 50–100 |
| 2000–10000 images | 30–50 |
| >10000 images | 20–30 |
Use early stopping to automatically determine the optimal stopping point.
Troubleshooting
Out of memory (OOM)
If you encounter CUDA out of memory errors:
- Reduce
batch_size
- Enable
gradient_checkpointing=True
- Reduce
resolution
- Increase
grad_accum_steps to maintain effective batch size
Training too slow
- Increase
batch_size (if memory allows)
- Use multiple GPUs with DDP
- Ensure you’re using GPU (check
device="cuda")
- Consider using a smaller model variant (e.g.,
RFDETRSmall instead of RFDETRLarge)
Loss not decreasing
- Check that your dataset is correctly formatted
- Verify annotations are correct (bounding boxes in the correct format)
- Try reducing the learning rate
- Check for class imbalance in your dataset