Skip to main content
This guide covers running SAM 3 training on local machines with one or more GPUs.

Quick Start

Train on a local GPU with an existing config:
python -m sam3.train.train \
  -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \
  --use-cluster 0 \
  --num-gpus 1

Command-Line Arguments

The training script (sam3.train.train) accepts these arguments:

Required

  • -c, --config: Path to config file (e.g., configs/your_config.yaml)

Optional

  • --use-cluster: Whether to use SLURM cluster
    • 0: Run locally (default)
    • 1: Submit to SLURM
  • --num-gpus: Number of GPUs per node
    • Overrides launcher.gpus_per_node in config
  • --num-nodes: Number of compute nodes
    • Overrides launcher.num_nodes in config
  • --partition: SLURM partition (for cluster only)
  • --account: SLURM account (for cluster only)
  • --qos: SLURM QoS (for cluster only)

Single GPU Training

1

Prepare Configuration

Create or modify a config file with single GPU settings:
launcher:
  num_nodes: 1
  gpus_per_node: 1

submitit:
  use_cluster: False

scratch:
  train_batch_size: 1
  gradient_accumulation_steps: 4  # Effective batch size = 4
2

Run Training

Start training:
python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 0 \
  --num-gpus 1
The training will:
  • Load the model and dataset
  • Initialize on the default GPU (cuda:0)
  • Start the training loop
  • Save checkpoints to experiment_log_dir/checkpoints/
3

Monitor Progress

Monitor training in another terminal:
# Watch logs
tail -f experiments/my_training/logs/*.log

# Launch TensorBoard
tensorboard --logdir experiments/my_training/tensorboard
Access TensorBoard at http://localhost:6006

Multi-GPU Training (Single Node)

Train on multiple GPUs using DistributedDataParallel:
1

Configure Multi-GPU

Update config for multiple GPUs:
launcher:
  num_nodes: 1
  gpus_per_node: 4  # Use 4 GPUs

submitit:
  use_cluster: False

scratch:
  train_batch_size: 1  # Per GPU
  # Effective batch size = 4 (1 per GPU × 4 GPUs)
2

Specify GPUs

Choose which GPUs to use:
# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Or use all available GPUs (default)
# CUDA_VISIBLE_DEVICES not set
3

Launch Training

Start distributed training:
python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 0 \
  --num-gpus 4
The script will:
  • Spawn 4 processes (one per GPU)
  • Set up distributed communication
  • Synchronize gradients across GPUs
  • Train 4× faster than single GPU

Training Modes

Full Training

Train and validate:
trainer:
  mode: train  # Train + validation
  max_epochs: 20
  val_epoch_freq: 10  # Validate every 10 epochs

Training Only

Skip validation for faster iterations:
trainer:
  mode: train_only  # No validation
  max_epochs: 20

Validation Only

Evaluate a trained checkpoint:
trainer:
  mode: val  # Validation only
  
trainer:
  checkpoint:
    resume_from: /path/to/checkpoint.pt
Run:
python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 0

Resume Training

Resume from a saved checkpoint:
1

Automatic Resume

By default, training auto-resumes from checkpoint.pt in the save directory:
trainer:
  checkpoint:
    save_dir: experiments/my_training/checkpoints
If checkpoint.pt exists, training continues from that point.
2

Explicit Resume

Resume from a specific checkpoint:
trainer:
  checkpoint:
    resume_from: /path/to/checkpoint_10.pt
This copies the checkpoint to the save directory and resumes.
3

Launch

Run training normally:
python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 0
The script will detect the checkpoint and resume automatically.

Monitoring Training

TensorBoard

View training metrics in real-time:
# Start TensorBoard
tensorboard --logdir experiments/my_training/tensorboard --port 6006

# Access at http://localhost:6006
Metrics include:
  • Training/validation losses
  • Learning rates
  • Gradient norms
  • Memory usage
  • Batch processing time

Log Files

Training logs are saved to:
experiments/my_training/
├── logs/
│   ├── train.log         # Primary training log
│   ├── train_stats.json  # Epoch statistics
│   ├── val_stats.json    # Validation results
│   └── best_stats.json   # Best metrics
├── checkpoints/
│   ├── checkpoint.pt     # Latest checkpoint
│   └── checkpoint_10.pt  # Saved epochs
└── tensorboard/

Training Progress

During training, you’ll see output like:
Train Epoch: [5]
Batch Time:  2.34s | Data Time:  0.12s | Mem: 14.2GB
Losses/train_all_loss: 0.523
Meters: {'detection_AP': 0.45, 'loss': 0.523}

Example Workflows

Fine-tune on Custom Dataset

1

Prepare Config

Create configs/my_dataset.yaml:
paths:
  dataset_root: /data/my_dataset
  experiment_log_dir: /experiments/my_dataset_ft
  bpe_path: sam3/assets/bpe_simple_vocab_16e6.txt.gz

launcher:
  num_nodes: 1
  gpus_per_node: 2

trainer:
  max_epochs: 20
  mode: train

scratch:
  enable_segmentation: True
  train_batch_size: 1
  lr_scale: 0.1
2

Train

python -m sam3.train.train \
  -c configs/my_dataset.yaml \
  --use-cluster 0 \
  --num-gpus 2
3

Evaluate

After training completes, run validation:
# Validation runs automatically, or run separately:
# Modify config mode to 'val' and re-run

Quick Experiment (Small Dataset)

# Create a config with limited data
python -m sam3.train.train \
  -c configs/quick_test.yaml \
  --use-cluster 0 \
  --num-gpus 1
Config:
trainer:
  max_epochs: 5
  skip_first_val: False
  val_epoch_freq: 1

data:
  train:
    dataset:
      limit_ids: 50  # Only 50 images

Debug Training

For debugging:
trainer:
  max_epochs: 1
  
scratch:
  num_train_workers: 0  # Single process
  train_batch_size: 1

launcher:
  gpus_per_node: 1
Run:
python -m sam3.train.train \
  -c configs/debug.yaml \
  --use-cluster 0 \
  --num-gpus 1

Common Issues

Out of Memory

Symptom: CUDA OOM error Solutions:
  1. Reduce batch size:
    scratch:
      train_batch_size: 1
    
  2. Use gradient accumulation:
    trainer:
      gradient_accumulation_steps: 4
    
  3. Reduce resolution:
    scratch:
      resolution: 512
    
  4. Disable segmentation:
    scratch:
      enable_segmentation: False
    

Slow Training

Symptom: Low GPU utilization Solutions:
  1. Increase data workers:
    scratch:
      num_train_workers: 8
    
  2. Enable pin_memory:
    data:
      train:
        pin_memory: True
    
  3. Use mixed precision:
    trainer:
      optim:
        amp:
          enabled: True
          amp_dtype: bfloat16
    

NaN Loss

Symptom: Loss becomes NaN Solutions:
  1. Reduce learning rate:
    scratch:
      lr_scale: 0.01
    
  2. Enable gradient clipping (should be default):
    trainer:
      optim:
        gradient_clip:
          max_norm: 0.1
    
  3. Check data for corrupted images
Training will automatically stop if loss becomes NaN. Check your data and reduce learning rate if this occurs.

Performance Tips

Maximize GPU Utilization

  1. Use AMP (Automatic Mixed Precision):
    trainer:
      optim:
        amp:
          enabled: True
          amp_dtype: bfloat16
    
  2. Tune batch size: Find the largest batch that fits in memory
  3. Use multiple workers: num_train_workers: 8-16
  4. Enable cudnn benchmark:
    trainer:
      cuda:
        cudnn_benchmark: true
    

Faster Iterations

  • Skip validation during development: mode: train_only
  • Use subset of data: limit_ids: 100
  • Reduce workers during debugging: num_train_workers: 0

Next Steps

Cluster Training

Scale to SLURM clusters for larger experiments

Evaluation

Evaluate your trained models

Build docs developers (and LLMs) love