Cluster Training

This guide covers running SAM 3 training on SLURM-managed computing clusters for large-scale experiments.

Overview

SAM 3 uses Submitit to interface with SLURM workload managers, enabling:

Multi-node distributed training
Automatic job submission and management
Checkpoint-based preemption handling
Resource specification (GPUs, memory, time)

Prerequisites

Access to a SLURM cluster
SLURM account and partition allocation
Shared filesystem (NFS, Lustre, etc.)
Python environment with SAM 3 installed on all nodes

Quick Start

Submit a training job to SLURM:

python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 1 \
  --num-gpus 8 \
  --num-nodes 4 \
  --partition gpu \
  --account my_account

SLURM Configuration

Configure SLURM parameters in your config file:

submitit:
  # Enable SLURM submission
  use_cluster: True
  
  # SLURM account (required)
  account: my_account
  
  # Partition/queue (required)
  partition: gpu
  
  # Quality of Service (optional)
  qos: high_priority
  
  # Job timeout in hours
  timeout_hour: 72
  
  # CPUs per task
  cpus_per_task: 10
  
  # Port range for distributed training
  port_range: [10000, 65000]
  
  # Node constraints (optional)
  constraint: volta32gb  # e.g., GPU type
  
  # Memory allocation
  mem_gb: 128  # Or use 'mem: 128G'
  
  # Exclude specific nodes (optional)
  exclude_nodes:
    - node001
    - node002
  
  # Include specific nodes (optional)
  # include_nodes:
  #   - node010
  #   - node011

Command-Line Overrides

Override SLURM settings from command line:

python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 1 \
  --partition gpu_a100 \
  --account research_team \
  --qos urgent \
  --num-gpus 8 \
  --num-nodes 4

Multi-Node Training

Configure Multi-Node

Set node and GPU counts:

launcher:
  num_nodes: 4
  gpus_per_node: 8

submitit:
  use_cluster: True
  partition: gpu
  account: my_account
  timeout_hour: 72
  cpus_per_task: 10

This will request:

4 compute nodes
8 GPUs per node
Total: 32 GPUs

Configure Distributed Backend

Set up NCCL for multi-node communication:

trainer:
  distributed:
    backend: nccl
    find_unused_parameters: True
    gradient_as_bucket_view: True
    timeout_mins: 30

Submit Job

Submit to SLURM:

python -m sam3.train.train \
  -c configs/multi_node.yaml \
  --use-cluster 1

The script will:

Create submitit logs directory
Submit SLURM job
Print job ID
Exit (training runs on cluster)

Monitor Job

Check job status:

# View job queue
squeue -u $USER

# Check job details
scontrol show job <job_id>

# View output logs
tail -f experiments/my_training/submitit_logs/<job_id>_0_log.out

Job Arrays

Run multiple training jobs in parallel (e.g., hyperparameter sweeps):

submitit:
  use_cluster: True
  
  # Job array configuration
  job_array:
    num_tasks: 10  # Run 10 parallel jobs
    task_index: 0  # Current task (set automatically)

launcher:
  num_nodes: 1
  gpus_per_node: 2

# Use task_index in config
scratch:
  lr_scale: ${get_lr_from_index:${submitit.job_array.task_index}}

Submit:

python -m sam3.train.train \
  -c configs/job_array.yaml \
  --use-cluster 1

This creates 10 independent jobs, each with a different configuration based on task_index.

Resource Management

GPU Types

Request specific GPU architectures:

submitit:
  constraint: a100  # A100 GPUs only
  # Or: volta, ampere, hopper, etc.

Memory Allocation

Specify memory requirements:

submitit:
  # Option 1: Total memory in GB
  mem_gb: 256
  
  # Option 2: Memory per node
  # mem: 256G

CPU Allocation

submitit:
  cpus_per_task: 10  # CPUs per GPU task

Rule of thumb: 8-12 CPUs per GPU for data loading.

Time Limits

submitit:
  timeout_hour: 72  # Job will be killed after 72 hours

Set timeout appropriately for your partition’s limits. Jobs exceeding partition time limits will fail immediately.

Checkpointing and Preemption

SAM 3 handles preemption automatically:

Automatic Resume

When a job is preempted or crashes:

Checkpoint is saved to experiment_log_dir/checkpoints/checkpoint.pt
Resubmit the same command
Training resumes from the last checkpoint

# Resubmit after preemption
python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 1

Checkpoint Frequency

trainer:
  checkpoint:
    save_dir: ${launcher.experiment_log_dir}/checkpoints
    save_freq: 0  # Save only latest (recommended for preemption)
    # Or save every N epochs:
    # save_freq: 5
    
    # Save specific epochs
    save_list: [10, 20, 30]

Initialize After Preemption

For large models with partial checkpoint saving:

trainer:
  checkpoint:
    skip_saving_parameters:
      - 'backbone.vision_backbone.*'
    initialize_after_preemption: True
    
    model_weight_initializer:
      _target_: my_initializer
      path: /path/to/pretrained.pt

Monitoring

SLURM Logs

Submitit creates logs in experiment_log_dir/submitit_logs/:

submitit_logs/
├── <job_id>_0_log.out      # stdout from rank 0
├── <job_id>_0_log.err      # stderr from rank 0
├── <job_id>_1_log.out      # stdout from rank 1
└── ...

Monitor logs:

# Watch primary worker
tail -f experiments/my_training/submitit_logs/<job_id>_0_log.out

# Check for errors
grep -i error experiments/my_training/submitit_logs/*.err

SLURM Commands

# View queue
squeue -u $USER

# Job details
scontrol show job <job_id>

# Cancel job
scancel <job_id>

# Job efficiency (after completion)
seff <job_id>

# Accounting info
sacct -j <job_id> --format=JobID,JobName,State,Elapsed,MaxRSS,MaxVMSize

TensorBoard on Cluster

Access TensorBoard from cluster:

# On cluster login node
tensorboard --logdir /path/to/experiments/tensorboard --port 6006

# On your local machine (SSH tunnel)
ssh -L 6006:localhost:6006 [email protected]

# Access at http://localhost:6006

Example Configurations

Single-Node 8-GPU Job

paths:
  experiment_log_dir: /cluster/work/my_experiment

launcher:
  num_nodes: 1
  gpus_per_node: 8

submitit:
  use_cluster: True
  account: my_account
  partition: gpu
  timeout_hour: 24
  cpus_per_task: 10
  mem_gb: 128

trainer:
  max_epochs: 20

scratch:
  train_batch_size: 2  # Per GPU, effective batch = 16

Submit:

python -m sam3.train.train -c configs/single_node.yaml --use-cluster 1

Multi-Node Large-Scale Training

paths:
  experiment_log_dir: /cluster/work/large_scale

launcher:
  num_nodes: 8
  gpus_per_node: 8

submitit:
  use_cluster: True
  account: research_team
  partition: gpu_a100
  qos: high
  timeout_hour: 72
  cpus_per_task: 12
  mem_gb: 256
  constraint: a100

trainer:
  max_epochs: 50
  distributed:
    backend: nccl
    comms_dtype: bfloat16  # Compress gradients

scratch:
  train_batch_size: 4  # Effective batch = 4 × 64 GPUs = 256
  num_train_workers: 12

Submit:

python -m sam3.train.train \
  -c configs/large_scale.yaml \
  --use-cluster 1 \
  --partition gpu_a100

Hyperparameter Sweep with Job Arrays

submitit:
  use_cluster: True
  account: my_account
  partition: gpu
  timeout_hour: 12
  
  job_array:
    num_tasks: 5  # 5 different learning rates

launcher:
  num_nodes: 1
  gpus_per_node: 2

# Define sweep values
lr_values:
  - 0.0001
  - 0.00005
  - 0.00001
  - 0.000005
  - 0.000001

scratch:
  lr_transformer: ${lr_values.${submitit.job_array.task_index}}

Troubleshooting

Job Fails to Start

Check: SLURM limits and permissions

# View account limits
sacctmgr show assoc where user=$USER format=account,partition,qos,maxjobs,maxnodes

# Check partition availability
sinfo -p gpu

Out of Memory on Cluster

Increase memory request:

submitit:
  mem_gb: 256  # Increase as needed

NCCL Timeout Errors

Increase timeout:

trainer:
  distributed:
    timeout_mins: 60  # Default is 30

Check network:

# Test node-to-node communication
export NCCL_DEBUG=INFO

Job Preempted Repeatedly

Use higher QoS or longer timeout:

submitit:
  qos: high_priority
  timeout_hour: 72

Slow Startup

Cause: Loading model on all GPUs Solution: Load on CPU first (default):

trainer:
  model:
    device: cpus  # Load to CPU, then move to GPU

On shared filesystems, many workers reading the same checkpoint simultaneously can cause slowdowns. Consider copying checkpoints to local node storage.

Best Practices

Checkpoint Strategy

Use shared filesystem for checkpoint_dir
Save frequently on preemptible partitions
Test resume before long jobs:

# Run 1 epoch, then resume
python -m sam3.train.train -c config.yaml --use-cluster 1
# After completion, resubmit to test resume
python -m sam3.train.train -c config.yaml --use-cluster 1

Resource Requests

Start small: Test on 1 node before scaling
Profile memory: Check nvidia-smi during training
Use seff: Analyze resource usage after jobs complete

Data Loading

Shared filesystem: Ensure dataset is on shared storage
Multiple workers: Use 8-12 workers per GPU
Pin memory: Enable for faster transfers:

data:
  train:
    pin_memory: True
    num_workers: 10

Job Naming

submitit:
  name: sam3_roboflow_v100_run1  # Helpful for queue viewing

Advanced Topics

Custom SLURM Parameters

Pass additional SLURM options:

submitit:
  # srun arguments
  srun_args:
    cpu_bind: cores  # CPU binding strategy
  
  # Comment for accounting
  comment: "SAM3 training experiment 42"

Node Selection

submitit:
  # Exclude problematic nodes
  exclude_nodes:
    - node042
    - node043
  
  # Or explicitly include nodes
  include_nodes:
    - node010
    - node011
    - node012
    - node013

Environment Variables

Set environment variables for the job:

trainer:
  env_variables:
    NCCL_DEBUG: INFO
    NCCL_IB_DISABLE: 1
    PYTORCH_CUDA_ALLOC_CONF: max_split_size_mb:512

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Overview

​Prerequisites

​Quick Start

​SLURM Configuration

​Command-Line Overrides

​Multi-Node Training

​Job Arrays

​Resource Management

​GPU Types

​Memory Allocation

​CPU Allocation

​Time Limits

​Checkpointing and Preemption

​Automatic Resume

​Checkpoint Frequency

​Initialize After Preemption

​Monitoring

​SLURM Logs

​SLURM Commands

​TensorBoard on Cluster

​Example Configurations

​Single-Node 8-GPU Job

​Multi-Node Large-Scale Training

​Hyperparameter Sweep with Job Arrays

​Troubleshooting

​Job Fails to Start

​Out of Memory on Cluster

​NCCL Timeout Errors

​Job Preempted Repeatedly

​Slow Startup

​Best Practices

​Checkpoint Strategy

​Resource Requests

​Data Loading

​Job Naming

​Advanced Topics

​Custom SLURM Parameters

​Node Selection

​Environment Variables

​Next Steps

Configuration

Evaluation

Build docs developers (and LLMs) love

Overview

Prerequisites

Quick Start

SLURM Configuration

Command-Line Overrides

Multi-Node Training

Job Arrays

Resource Management

GPU Types

Memory Allocation

CPU Allocation

Time Limits

Checkpointing and Preemption

Automatic Resume

Checkpoint Frequency

Initialize After Preemption

Monitoring

SLURM Logs

SLURM Commands

TensorBoard on Cluster

Example Configurations

Single-Node 8-GPU Job

Multi-Node Large-Scale Training

Hyperparameter Sweep with Job Arrays

Troubleshooting

Job Fails to Start

Out of Memory on Cluster

NCCL Timeout Errors

Job Preempted Repeatedly

Slow Startup

Best Practices

Checkpoint Strategy

Resource Requests

Data Loading

Job Naming

Advanced Topics

Custom SLURM Parameters

Node Selection

Environment Variables

Next Steps