Local Training

This guide covers running SAM 3 training on local machines with one or more GPUs.

Quick Start

Train on a local GPU with an existing config:

python -m sam3.train.train \
  -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \
  --use-cluster 0 \
  --num-gpus 1

Command-Line Arguments

The training script (sam3.train.train) accepts these arguments:

Required

-c, --config: Path to config file (e.g., configs/your_config.yaml)

Optional

--use-cluster: Whether to use SLURM cluster
- 0: Run locally (default)
- 1: Submit to SLURM
--num-gpus: Number of GPUs per node
- Overrides launcher.gpus_per_node in config
--num-nodes: Number of compute nodes
- Overrides launcher.num_nodes in config
--partition: SLURM partition (for cluster only)
--account: SLURM account (for cluster only)
--qos: SLURM QoS (for cluster only)

Single GPU Training

Prepare Configuration

Create or modify a config file with single GPU settings:

launcher:
  num_nodes: 1
  gpus_per_node: 1

submitit:
  use_cluster: False

scratch:
  train_batch_size: 1
  gradient_accumulation_steps: 4  # Effective batch size = 4

Run Training

Start training:

python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 0 \
  --num-gpus 1

The training will:

Load the model and dataset
Initialize on the default GPU (cuda:0)
Start the training loop
Save checkpoints to experiment_log_dir/checkpoints/

Monitor Progress

Monitor training in another terminal:

# Watch logs
tail -f experiments/my_training/logs/*.log

# Launch TensorBoard
tensorboard --logdir experiments/my_training/tensorboard

Access TensorBoard at http://localhost:6006

Multi-GPU Training (Single Node)

Train on multiple GPUs using DistributedDataParallel:

Configure Multi-GPU

Update config for multiple GPUs:

launcher:
  num_nodes: 1
  gpus_per_node: 4  # Use 4 GPUs

submitit:
  use_cluster: False

scratch:
  train_batch_size: 1  # Per GPU
  # Effective batch size = 4 (1 per GPU × 4 GPUs)

Specify GPUs

Choose which GPUs to use:

# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Or use all available GPUs (default)
# CUDA_VISIBLE_DEVICES not set

Launch Training

Start distributed training:

python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 0 \
  --num-gpus 4

The script will:

Spawn 4 processes (one per GPU)
Set up distributed communication
Synchronize gradients across GPUs
Train 4× faster than single GPU

Training Modes

Full Training

Train and validate:

trainer:
  mode: train  # Train + validation
  max_epochs: 20
  val_epoch_freq: 10  # Validate every 10 epochs

Training Only

Skip validation for faster iterations:

trainer:
  mode: train_only  # No validation
  max_epochs: 20

Validation Only

Evaluate a trained checkpoint:

trainer:
  mode: val  # Validation only
  
trainer:
  checkpoint:
    resume_from: /path/to/checkpoint.pt

Run:

python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 0

Resume Training

Resume from a saved checkpoint:

Automatic Resume

By default, training auto-resumes from checkpoint.pt in the save directory:

trainer:
  checkpoint:
    save_dir: experiments/my_training/checkpoints

If checkpoint.pt exists, training continues from that point.

Explicit Resume

Resume from a specific checkpoint:

trainer:
  checkpoint:
    resume_from: /path/to/checkpoint_10.pt

This copies the checkpoint to the save directory and resumes.

Launch

Run training normally:

python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 0

The script will detect the checkpoint and resume automatically.

Monitoring Training

TensorBoard

View training metrics in real-time:

# Start TensorBoard
tensorboard --logdir experiments/my_training/tensorboard --port 6006

# Access at http://localhost:6006

Metrics include:

Training/validation losses
Learning rates
Gradient norms
Memory usage
Batch processing time

Log Files

Training logs are saved to:

experiments/my_training/
├── logs/
│   ├── train.log         # Primary training log
│   ├── train_stats.json  # Epoch statistics
│   ├── val_stats.json    # Validation results
│   └── best_stats.json   # Best metrics
├── checkpoints/
│   ├── checkpoint.pt     # Latest checkpoint
│   └── checkpoint_10.pt  # Saved epochs
└── tensorboard/

Training Progress

During training, you’ll see output like:

Train Epoch: [5]
Batch Time:  2.34s | Data Time:  0.12s | Mem: 14.2GB
Losses/train_all_loss: 0.523
Meters: {'detection_AP': 0.45, 'loss': 0.523}

Example Workflows

Fine-tune on Custom Dataset

Prepare Config

Create configs/my_dataset.yaml:

paths:
  dataset_root: /data/my_dataset
  experiment_log_dir: /experiments/my_dataset_ft
  bpe_path: sam3/assets/bpe_simple_vocab_16e6.txt.gz

launcher:
  num_nodes: 1
  gpus_per_node: 2

trainer:
  max_epochs: 20
  mode: train

scratch:
  enable_segmentation: True
  train_batch_size: 1
  lr_scale: 0.1

Train

python -m sam3.train.train \
  -c configs/my_dataset.yaml \
  --use-cluster 0 \
  --num-gpus 2

Evaluate

After training completes, run validation:

# Validation runs automatically, or run separately:
# Modify config mode to 'val' and re-run

Quick Experiment (Small Dataset)

# Create a config with limited data
python -m sam3.train.train \
  -c configs/quick_test.yaml \
  --use-cluster 0 \
  --num-gpus 1

Config:

trainer:
  max_epochs: 5
  skip_first_val: False
  val_epoch_freq: 1

data:
  train:
    dataset:
      limit_ids: 50  # Only 50 images

Debug Training

For debugging:

trainer:
  max_epochs: 1
  
scratch:
  num_train_workers: 0  # Single process
  train_batch_size: 1

launcher:
  gpus_per_node: 1

Run:

python -m sam3.train.train \
  -c configs/debug.yaml \
  --use-cluster 0 \
  --num-gpus 1

Common Issues

Out of Memory

Symptom: CUDA OOM error Solutions:

Reduce batch size:
```
scratch:
  train_batch_size: 1
```

Use gradient accumulation:

trainer:
  gradient_accumulation_steps: 4

Reduce resolution:
```
scratch:
  resolution: 512
```
Disable segmentation:
```
scratch:
  enable_segmentation: False
```

Slow Training

Symptom: Low GPU utilization Solutions:

Increase data workers:
```
scratch:
  num_train_workers: 8
```
Enable pin_memory:
```
data:
  train:
    pin_memory: True
```

Use mixed precision:

trainer:
  optim:
    amp:
      enabled: True
      amp_dtype: bfloat16

NaN Loss

Symptom: Loss becomes NaN Solutions:

Reduce learning rate:
```
scratch:
  lr_scale: 0.01
```

Enable gradient clipping (should be default):

trainer:
  optim:
    gradient_clip:
      max_norm: 0.1

Check data for corrupted images

Training will automatically stop if loss becomes NaN. Check your data and reduce learning rate if this occurs.

Performance Tips

Maximize GPU Utilization

Use AMP (Automatic Mixed Precision):

trainer:
  optim:
    amp:
      enabled: True
      amp_dtype: bfloat16

Tune batch size: Find the largest batch that fits in memory
Use multiple workers: num_train_workers: 8-16

Enable cudnn benchmark:

trainer:
  cuda:
    cudnn_benchmark: true

Faster Iterations

Skip validation during development: mode: train_only
Use subset of data: limit_ids: 100
Reduce workers during debugging: num_train_workers: 0

Next Steps

Cluster Training

Scale to SLURM clusters for larger experiments

Evaluation

Evaluate your trained models

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

Local Training

Quick Start

Command-Line Arguments

Required

Optional

Single GPU Training

Multi-GPU Training (Single Node)

Training Modes

Full Training

Training Only

Validation Only

Resume Training

Monitoring Training

TensorBoard

Log Files

Training Progress

Example Workflows

Fine-tune on Custom Dataset

Quick Experiment (Small Dataset)

Debug Training

Common Issues

Out of Memory

Slow Training

NaN Loss

Performance Tips

Maximize GPU Utilization

Faster Iterations

Next Steps

Cluster Training

Evaluation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Quick Start

​Command-Line Arguments

​Required

​Optional

​Single GPU Training

​Multi-GPU Training (Single Node)

​Training Modes

​Full Training

​Training Only

​Validation Only

​Resume Training

​Monitoring Training

​TensorBoard

​Log Files

​Training Progress

​Example Workflows

​Fine-tune on Custom Dataset

​Quick Experiment (Small Dataset)

​Debug Training

​Common Issues

​Out of Memory

​Slow Training

​NaN Loss

​Performance Tips

​Maximize GPU Utilization

​Faster Iterations

​Next Steps

Cluster Training

Evaluation

Build docs developers (and LLMs) love

Quick Start

Command-Line Arguments

Required

Optional

Single GPU Training

Multi-GPU Training (Single Node)

Training Modes

Full Training

Training Only

Validation Only

Resume Training

Monitoring Training

TensorBoard

Log Files

Training Progress

Example Workflows

Fine-tune on Custom Dataset

Quick Experiment (Small Dataset)

Debug Training

Common Issues

Out of Memory

Slow Training

NaN Loss

Performance Tips

Maximize GPU Utilization

Faster Iterations

Next Steps