Skip to main content
This guide covers running SAM 3 training on SLURM-managed computing clusters for large-scale experiments.

Overview

SAM 3 uses Submitit to interface with SLURM workload managers, enabling:
  • Multi-node distributed training
  • Automatic job submission and management
  • Checkpoint-based preemption handling
  • Resource specification (GPUs, memory, time)

Prerequisites

  • Access to a SLURM cluster
  • SLURM account and partition allocation
  • Shared filesystem (NFS, Lustre, etc.)
  • Python environment with SAM 3 installed on all nodes

Quick Start

Submit a training job to SLURM:
python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 1 \
  --num-gpus 8 \
  --num-nodes 4 \
  --partition gpu \
  --account my_account

SLURM Configuration

Configure SLURM parameters in your config file:
submitit:
  # Enable SLURM submission
  use_cluster: True
  
  # SLURM account (required)
  account: my_account
  
  # Partition/queue (required)
  partition: gpu
  
  # Quality of Service (optional)
  qos: high_priority
  
  # Job timeout in hours
  timeout_hour: 72
  
  # CPUs per task
  cpus_per_task: 10
  
  # Port range for distributed training
  port_range: [10000, 65000]
  
  # Node constraints (optional)
  constraint: volta32gb  # e.g., GPU type
  
  # Memory allocation
  mem_gb: 128  # Or use 'mem: 128G'
  
  # Exclude specific nodes (optional)
  exclude_nodes:
    - node001
    - node002
  
  # Include specific nodes (optional)
  # include_nodes:
  #   - node010
  #   - node011

Command-Line Overrides

Override SLURM settings from command line:
python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 1 \
  --partition gpu_a100 \
  --account research_team \
  --qos urgent \
  --num-gpus 8 \
  --num-nodes 4

Multi-Node Training

1

Configure Multi-Node

Set node and GPU counts:
launcher:
  num_nodes: 4
  gpus_per_node: 8

submitit:
  use_cluster: True
  partition: gpu
  account: my_account
  timeout_hour: 72
  cpus_per_task: 10
This will request:
  • 4 compute nodes
  • 8 GPUs per node
  • Total: 32 GPUs
2

Configure Distributed Backend

Set up NCCL for multi-node communication:
trainer:
  distributed:
    backend: nccl
    find_unused_parameters: True
    gradient_as_bucket_view: True
    timeout_mins: 30
3

Submit Job

Submit to SLURM:
python -m sam3.train.train \
  -c configs/multi_node.yaml \
  --use-cluster 1
The script will:
  • Create submitit logs directory
  • Submit SLURM job
  • Print job ID
  • Exit (training runs on cluster)
4

Monitor Job

Check job status:
# View job queue
squeue -u $USER

# Check job details
scontrol show job <job_id>

# View output logs
tail -f experiments/my_training/submitit_logs/<job_id>_0_log.out

Job Arrays

Run multiple training jobs in parallel (e.g., hyperparameter sweeps):
submitit:
  use_cluster: True
  
  # Job array configuration
  job_array:
    num_tasks: 10  # Run 10 parallel jobs
    task_index: 0  # Current task (set automatically)

launcher:
  num_nodes: 1
  gpus_per_node: 2

# Use task_index in config
scratch:
  lr_scale: ${get_lr_from_index:${submitit.job_array.task_index}}
Submit:
python -m sam3.train.train \
  -c configs/job_array.yaml \
  --use-cluster 1
This creates 10 independent jobs, each with a different configuration based on task_index.

Resource Management

GPU Types

Request specific GPU architectures:
submitit:
  constraint: a100  # A100 GPUs only
  # Or: volta, ampere, hopper, etc.

Memory Allocation

Specify memory requirements:
submitit:
  # Option 1: Total memory in GB
  mem_gb: 256
  
  # Option 2: Memory per node
  # mem: 256G

CPU Allocation

submitit:
  cpus_per_task: 10  # CPUs per GPU task
Rule of thumb: 8-12 CPUs per GPU for data loading.

Time Limits

submitit:
  timeout_hour: 72  # Job will be killed after 72 hours
Set timeout appropriately for your partition’s limits. Jobs exceeding partition time limits will fail immediately.

Checkpointing and Preemption

SAM 3 handles preemption automatically:

Automatic Resume

When a job is preempted or crashes:
  1. Checkpoint is saved to experiment_log_dir/checkpoints/checkpoint.pt
  2. Resubmit the same command
  3. Training resumes from the last checkpoint
# Resubmit after preemption
python -m sam3.train.train \
  -c configs/my_config.yaml \
  --use-cluster 1

Checkpoint Frequency

trainer:
  checkpoint:
    save_dir: ${launcher.experiment_log_dir}/checkpoints
    save_freq: 0  # Save only latest (recommended for preemption)
    # Or save every N epochs:
    # save_freq: 5
    
    # Save specific epochs
    save_list: [10, 20, 30]

Initialize After Preemption

For large models with partial checkpoint saving:
trainer:
  checkpoint:
    skip_saving_parameters:
      - 'backbone.vision_backbone.*'
    initialize_after_preemption: True
    
    model_weight_initializer:
      _target_: my_initializer
      path: /path/to/pretrained.pt

Monitoring

SLURM Logs

Submitit creates logs in experiment_log_dir/submitit_logs/:
submitit_logs/
├── <job_id>_0_log.out      # stdout from rank 0
├── <job_id>_0_log.err      # stderr from rank 0
├── <job_id>_1_log.out      # stdout from rank 1
└── ...
Monitor logs:
# Watch primary worker
tail -f experiments/my_training/submitit_logs/<job_id>_0_log.out

# Check for errors
grep -i error experiments/my_training/submitit_logs/*.err

SLURM Commands

# View queue
squeue -u $USER

# Job details
scontrol show job <job_id>

# Cancel job
scancel <job_id>

# Job efficiency (after completion)
seff <job_id>

# Accounting info
sacct -j <job_id> --format=JobID,JobName,State,Elapsed,MaxRSS,MaxVMSize

TensorBoard on Cluster

Access TensorBoard from cluster:
# On cluster login node
tensorboard --logdir /path/to/experiments/tensorboard --port 6006

# On your local machine (SSH tunnel)
ssh -L 6006:localhost:6006 [email protected]

# Access at http://localhost:6006

Example Configurations

Single-Node 8-GPU Job

paths:
  experiment_log_dir: /cluster/work/my_experiment

launcher:
  num_nodes: 1
  gpus_per_node: 8

submitit:
  use_cluster: True
  account: my_account
  partition: gpu
  timeout_hour: 24
  cpus_per_task: 10
  mem_gb: 128

trainer:
  max_epochs: 20

scratch:
  train_batch_size: 2  # Per GPU, effective batch = 16
Submit:
python -m sam3.train.train -c configs/single_node.yaml --use-cluster 1

Multi-Node Large-Scale Training

paths:
  experiment_log_dir: /cluster/work/large_scale

launcher:
  num_nodes: 8
  gpus_per_node: 8

submitit:
  use_cluster: True
  account: research_team
  partition: gpu_a100
  qos: high
  timeout_hour: 72
  cpus_per_task: 12
  mem_gb: 256
  constraint: a100

trainer:
  max_epochs: 50
  distributed:
    backend: nccl
    comms_dtype: bfloat16  # Compress gradients

scratch:
  train_batch_size: 4  # Effective batch = 4 × 64 GPUs = 256
  num_train_workers: 12
Submit:
python -m sam3.train.train \
  -c configs/large_scale.yaml \
  --use-cluster 1 \
  --partition gpu_a100

Hyperparameter Sweep with Job Arrays

submitit:
  use_cluster: True
  account: my_account
  partition: gpu
  timeout_hour: 12
  
  job_array:
    num_tasks: 5  # 5 different learning rates

launcher:
  num_nodes: 1
  gpus_per_node: 2

# Define sweep values
lr_values:
  - 0.0001
  - 0.00005
  - 0.00001
  - 0.000005
  - 0.000001

scratch:
  lr_transformer: ${lr_values.${submitit.job_array.task_index}}

Troubleshooting

Job Fails to Start

Check: SLURM limits and permissions
# View account limits
sacctmgr show assoc where user=$USER format=account,partition,qos,maxjobs,maxnodes

# Check partition availability
sinfo -p gpu

Out of Memory on Cluster

Increase memory request:
submitit:
  mem_gb: 256  # Increase as needed

NCCL Timeout Errors

Increase timeout:
trainer:
  distributed:
    timeout_mins: 60  # Default is 30
Check network:
# Test node-to-node communication
export NCCL_DEBUG=INFO

Job Preempted Repeatedly

Use higher QoS or longer timeout:
submitit:
  qos: high_priority
  timeout_hour: 72

Slow Startup

Cause: Loading model on all GPUs Solution: Load on CPU first (default):
trainer:
  model:
    device: cpus  # Load to CPU, then move to GPU
On shared filesystems, many workers reading the same checkpoint simultaneously can cause slowdowns. Consider copying checkpoints to local node storage.

Best Practices

Checkpoint Strategy

  1. Use shared filesystem for checkpoint_dir
  2. Save frequently on preemptible partitions
  3. Test resume before long jobs:
# Run 1 epoch, then resume
python -m sam3.train.train -c config.yaml --use-cluster 1
# After completion, resubmit to test resume
python -m sam3.train.train -c config.yaml --use-cluster 1

Resource Requests

  1. Start small: Test on 1 node before scaling
  2. Profile memory: Check nvidia-smi during training
  3. Use seff: Analyze resource usage after jobs complete

Data Loading

  1. Shared filesystem: Ensure dataset is on shared storage
  2. Multiple workers: Use 8-12 workers per GPU
  3. Pin memory: Enable for faster transfers:
data:
  train:
    pin_memory: True
    num_workers: 10

Job Naming

submitit:
  name: sam3_roboflow_v100_run1  # Helpful for queue viewing

Advanced Topics

Custom SLURM Parameters

Pass additional SLURM options:
submitit:
  # srun arguments
  srun_args:
    cpu_bind: cores  # CPU binding strategy
  
  # Comment for accounting
  comment: "SAM3 training experiment 42"

Node Selection

submitit:
  # Exclude problematic nodes
  exclude_nodes:
    - node042
    - node043
  
  # Or explicitly include nodes
  include_nodes:
    - node010
    - node011
    - node012
    - node013

Environment Variables

Set environment variables for the job:
trainer:
  env_variables:
    NCCL_DEBUG: INFO
    NCCL_IB_DISABLE: 1
    PYTORCH_CUDA_ALLOC_CONF: max_split_size_mb:512

Next Steps

Configuration

Deep dive into configuration options

Evaluation

Evaluate trained models

Build docs developers (and LLMs) love