This guide covers running SAM 3 training on local machines with one or more GPUs.
Quick Start
Train on a local GPU with an existing config:
python -m sam3.train.train \
-c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \
--use-cluster 0 \
--num-gpus 1
Command-Line Arguments
The training script (sam3.train.train) accepts these arguments:
Required
-c, --config: Path to config file (e.g., configs/your_config.yaml)
Optional
-
--use-cluster: Whether to use SLURM cluster
0: Run locally (default)
1: Submit to SLURM
-
--num-gpus: Number of GPUs per node
- Overrides
launcher.gpus_per_node in config
-
--num-nodes: Number of compute nodes
- Overrides
launcher.num_nodes in config
-
--partition: SLURM partition (for cluster only)
-
--account: SLURM account (for cluster only)
-
--qos: SLURM QoS (for cluster only)
Single GPU Training
Prepare Configuration
Create or modify a config file with single GPU settings:launcher:
num_nodes: 1
gpus_per_node: 1
submitit:
use_cluster: False
scratch:
train_batch_size: 1
gradient_accumulation_steps: 4 # Effective batch size = 4
Run Training
Start training:python -m sam3.train.train \
-c configs/my_config.yaml \
--use-cluster 0 \
--num-gpus 1
The training will:
- Load the model and dataset
- Initialize on the default GPU (cuda:0)
- Start the training loop
- Save checkpoints to
experiment_log_dir/checkpoints/
Monitor Progress
Monitor training in another terminal:# Watch logs
tail -f experiments/my_training/logs/*.log
# Launch TensorBoard
tensorboard --logdir experiments/my_training/tensorboard
Access TensorBoard at http://localhost:6006
Multi-GPU Training (Single Node)
Train on multiple GPUs using DistributedDataParallel:
Configure Multi-GPU
Update config for multiple GPUs:launcher:
num_nodes: 1
gpus_per_node: 4 # Use 4 GPUs
submitit:
use_cluster: False
scratch:
train_batch_size: 1 # Per GPU
# Effective batch size = 4 (1 per GPU × 4 GPUs)
Specify GPUs
Choose which GPUs to use:# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Or use all available GPUs (default)
# CUDA_VISIBLE_DEVICES not set
Launch Training
Start distributed training:python -m sam3.train.train \
-c configs/my_config.yaml \
--use-cluster 0 \
--num-gpus 4
The script will:
- Spawn 4 processes (one per GPU)
- Set up distributed communication
- Synchronize gradients across GPUs
- Train 4× faster than single GPU
Training Modes
Full Training
Train and validate:
trainer:
mode: train # Train + validation
max_epochs: 20
val_epoch_freq: 10 # Validate every 10 epochs
Training Only
Skip validation for faster iterations:
trainer:
mode: train_only # No validation
max_epochs: 20
Validation Only
Evaluate a trained checkpoint:
trainer:
mode: val # Validation only
trainer:
checkpoint:
resume_from: /path/to/checkpoint.pt
Run:
python -m sam3.train.train \
-c configs/my_config.yaml \
--use-cluster 0
Resume Training
Resume from a saved checkpoint:
Automatic Resume
By default, training auto-resumes from checkpoint.pt in the save directory:trainer:
checkpoint:
save_dir: experiments/my_training/checkpoints
If checkpoint.pt exists, training continues from that point. Explicit Resume
Resume from a specific checkpoint:trainer:
checkpoint:
resume_from: /path/to/checkpoint_10.pt
This copies the checkpoint to the save directory and resumes. Launch
Run training normally:python -m sam3.train.train \
-c configs/my_config.yaml \
--use-cluster 0
The script will detect the checkpoint and resume automatically.
Monitoring Training
TensorBoard
View training metrics in real-time:
# Start TensorBoard
tensorboard --logdir experiments/my_training/tensorboard --port 6006
# Access at http://localhost:6006
Metrics include:
- Training/validation losses
- Learning rates
- Gradient norms
- Memory usage
- Batch processing time
Log Files
Training logs are saved to:
experiments/my_training/
├── logs/
│ ├── train.log # Primary training log
│ ├── train_stats.json # Epoch statistics
│ ├── val_stats.json # Validation results
│ └── best_stats.json # Best metrics
├── checkpoints/
│ ├── checkpoint.pt # Latest checkpoint
│ └── checkpoint_10.pt # Saved epochs
└── tensorboard/
Training Progress
During training, you’ll see output like:
Train Epoch: [5]
Batch Time: 2.34s | Data Time: 0.12s | Mem: 14.2GB
Losses/train_all_loss: 0.523
Meters: {'detection_AP': 0.45, 'loss': 0.523}
Example Workflows
Fine-tune on Custom Dataset
Prepare Config
Create configs/my_dataset.yaml:paths:
dataset_root: /data/my_dataset
experiment_log_dir: /experiments/my_dataset_ft
bpe_path: sam3/assets/bpe_simple_vocab_16e6.txt.gz
launcher:
num_nodes: 1
gpus_per_node: 2
trainer:
max_epochs: 20
mode: train
scratch:
enable_segmentation: True
train_batch_size: 1
lr_scale: 0.1
Train
python -m sam3.train.train \
-c configs/my_dataset.yaml \
--use-cluster 0 \
--num-gpus 2
Evaluate
After training completes, run validation:# Validation runs automatically, or run separately:
# Modify config mode to 'val' and re-run
Quick Experiment (Small Dataset)
# Create a config with limited data
python -m sam3.train.train \
-c configs/quick_test.yaml \
--use-cluster 0 \
--num-gpus 1
Config:
trainer:
max_epochs: 5
skip_first_val: False
val_epoch_freq: 1
data:
train:
dataset:
limit_ids: 50 # Only 50 images
Debug Training
For debugging:
trainer:
max_epochs: 1
scratch:
num_train_workers: 0 # Single process
train_batch_size: 1
launcher:
gpus_per_node: 1
Run:
python -m sam3.train.train \
-c configs/debug.yaml \
--use-cluster 0 \
--num-gpus 1
Common Issues
Out of Memory
Symptom: CUDA OOM error
Solutions:
-
Reduce batch size:
scratch:
train_batch_size: 1
-
Use gradient accumulation:
trainer:
gradient_accumulation_steps: 4
-
Reduce resolution:
-
Disable segmentation:
scratch:
enable_segmentation: False
Slow Training
Symptom: Low GPU utilization
Solutions:
-
Increase data workers:
scratch:
num_train_workers: 8
-
Enable pin_memory:
data:
train:
pin_memory: True
-
Use mixed precision:
trainer:
optim:
amp:
enabled: True
amp_dtype: bfloat16
NaN Loss
Symptom: Loss becomes NaN
Solutions:
-
Reduce learning rate:
-
Enable gradient clipping (should be default):
trainer:
optim:
gradient_clip:
max_norm: 0.1
-
Check data for corrupted images
Training will automatically stop if loss becomes NaN. Check your data and reduce learning rate if this occurs.
Maximize GPU Utilization
-
Use AMP (Automatic Mixed Precision):
trainer:
optim:
amp:
enabled: True
amp_dtype: bfloat16
-
Tune batch size: Find the largest batch that fits in memory
-
Use multiple workers:
num_train_workers: 8-16
-
Enable cudnn benchmark:
trainer:
cuda:
cudnn_benchmark: true
Faster Iterations
- Skip validation during development:
mode: train_only
- Use subset of data:
limit_ids: 100
- Reduce workers during debugging:
num_train_workers: 0
Next Steps
Cluster Training
Scale to SLURM clusters for larger experiments
Evaluation
Evaluate your trained models