This guide covers running SAM 3 training on SLURM-managed computing clusters for large-scale experiments.
Overview
SAM 3 uses Submitit to interface with SLURM workload managers, enabling:
Multi-node distributed training
Automatic job submission and management
Checkpoint-based preemption handling
Resource specification (GPUs, memory, time)
Prerequisites
Access to a SLURM cluster
SLURM account and partition allocation
Shared filesystem (NFS, Lustre, etc.)
Python environment with SAM 3 installed on all nodes
Quick Start
Submit a training job to SLURM:
python -m sam3.train.train \
-c configs/my_config.yaml \
--use-cluster 1 \
--num-gpus 8 \
--num-nodes 4 \
--partition gpu \
--account my_account
SLURM Configuration
Configure SLURM parameters in your config file:
submitit :
# Enable SLURM submission
use_cluster : True
# SLURM account (required)
account : my_account
# Partition/queue (required)
partition : gpu
# Quality of Service (optional)
qos : high_priority
# Job timeout in hours
timeout_hour : 72
# CPUs per task
cpus_per_task : 10
# Port range for distributed training
port_range : [ 10000 , 65000 ]
# Node constraints (optional)
constraint : volta32gb # e.g., GPU type
# Memory allocation
mem_gb : 128 # Or use 'mem: 128G'
# Exclude specific nodes (optional)
exclude_nodes :
- node001
- node002
# Include specific nodes (optional)
# include_nodes:
# - node010
# - node011
Command-Line Overrides
Override SLURM settings from command line:
python -m sam3.train.train \
-c configs/my_config.yaml \
--use-cluster 1 \
--partition gpu_a100 \
--account research_team \
--qos urgent \
--num-gpus 8 \
--num-nodes 4
Multi-Node Training
Configure Multi-Node
Set node and GPU counts: launcher :
num_nodes : 4
gpus_per_node : 8
submitit :
use_cluster : True
partition : gpu
account : my_account
timeout_hour : 72
cpus_per_task : 10
This will request:
4 compute nodes
8 GPUs per node
Total: 32 GPUs
Configure Distributed Backend
Set up NCCL for multi-node communication: trainer :
distributed :
backend : nccl
find_unused_parameters : True
gradient_as_bucket_view : True
timeout_mins : 30
Submit Job
Submit to SLURM: python -m sam3.train.train \
-c configs/multi_node.yaml \
--use-cluster 1
The script will:
Create submitit logs directory
Submit SLURM job
Print job ID
Exit (training runs on cluster)
Monitor Job
Check job status: # View job queue
squeue -u $USER
# Check job details
scontrol show job < job_i d >
# View output logs
tail -f experiments/my_training/submitit_logs/ < job_i d > _0_log.out
Job Arrays
Run multiple training jobs in parallel (e.g., hyperparameter sweeps):
submitit :
use_cluster : True
# Job array configuration
job_array :
num_tasks : 10 # Run 10 parallel jobs
task_index : 0 # Current task (set automatically)
launcher :
num_nodes : 1
gpus_per_node : 2
# Use task_index in config
scratch :
lr_scale : ${get_lr_from_index:${submitit.job_array.task_index}}
Submit:
python -m sam3.train.train \
-c configs/job_array.yaml \
--use-cluster 1
This creates 10 independent jobs, each with a different configuration based on task_index.
Resource Management
GPU Types
Request specific GPU architectures:
submitit :
constraint : a100 # A100 GPUs only
# Or: volta, ampere, hopper, etc.
Memory Allocation
Specify memory requirements:
submitit :
# Option 1: Total memory in GB
mem_gb : 256
# Option 2: Memory per node
# mem: 256G
CPU Allocation
submitit :
cpus_per_task : 10 # CPUs per GPU task
Rule of thumb: 8-12 CPUs per GPU for data loading.
Time Limits
submitit :
timeout_hour : 72 # Job will be killed after 72 hours
Set timeout appropriately for your partition’s limits. Jobs exceeding partition time limits will fail immediately.
Checkpointing and Preemption
SAM 3 handles preemption automatically:
Automatic Resume
When a job is preempted or crashes:
Checkpoint is saved to experiment_log_dir/checkpoints/checkpoint.pt
Resubmit the same command
Training resumes from the last checkpoint
# Resubmit after preemption
python -m sam3.train.train \
-c configs/my_config.yaml \
--use-cluster 1
Checkpoint Frequency
trainer :
checkpoint :
save_dir : ${launcher.experiment_log_dir}/checkpoints
save_freq : 0 # Save only latest (recommended for preemption)
# Or save every N epochs:
# save_freq: 5
# Save specific epochs
save_list : [ 10 , 20 , 30 ]
Initialize After Preemption
For large models with partial checkpoint saving:
trainer :
checkpoint :
skip_saving_parameters :
- 'backbone.vision_backbone.*'
initialize_after_preemption : True
model_weight_initializer :
_target_ : my_initializer
path : /path/to/pretrained.pt
Monitoring
SLURM Logs
Submitit creates logs in experiment_log_dir/submitit_logs/:
submitit_logs/
├── <job_id>_0_log.out # stdout from rank 0
├── <job_id>_0_log.err # stderr from rank 0
├── <job_id>_1_log.out # stdout from rank 1
└── ...
Monitor logs:
# Watch primary worker
tail -f experiments/my_training/submitit_logs/ < job_i d > _0_log.out
# Check for errors
grep -i error experiments/my_training/submitit_logs/ * .err
SLURM Commands
# View queue
squeue -u $USER
# Job details
scontrol show job < job_i d >
# Cancel job
scancel < job_i d >
# Job efficiency (after completion)
seff < job_i d >
# Accounting info
sacct -j < job_i d > --format=JobID,JobName,State,Elapsed,MaxRSS,MaxVMSize
TensorBoard on Cluster
Access TensorBoard from cluster:
# On cluster login node
tensorboard --logdir /path/to/experiments/tensorboard --port 6006
# On your local machine (SSH tunnel)
ssh -L 6006:localhost:6006 [email protected]
# Access at http://localhost:6006
Example Configurations
Single-Node 8-GPU Job
paths :
experiment_log_dir : /cluster/work/my_experiment
launcher :
num_nodes : 1
gpus_per_node : 8
submitit :
use_cluster : True
account : my_account
partition : gpu
timeout_hour : 24
cpus_per_task : 10
mem_gb : 128
trainer :
max_epochs : 20
scratch :
train_batch_size : 2 # Per GPU, effective batch = 16
Submit:
python -m sam3.train.train -c configs/single_node.yaml --use-cluster 1
Multi-Node Large-Scale Training
paths :
experiment_log_dir : /cluster/work/large_scale
launcher :
num_nodes : 8
gpus_per_node : 8
submitit :
use_cluster : True
account : research_team
partition : gpu_a100
qos : high
timeout_hour : 72
cpus_per_task : 12
mem_gb : 256
constraint : a100
trainer :
max_epochs : 50
distributed :
backend : nccl
comms_dtype : bfloat16 # Compress gradients
scratch :
train_batch_size : 4 # Effective batch = 4 × 64 GPUs = 256
num_train_workers : 12
Submit:
python -m sam3.train.train \
-c configs/large_scale.yaml \
--use-cluster 1 \
--partition gpu_a100
Hyperparameter Sweep with Job Arrays
submitit :
use_cluster : True
account : my_account
partition : gpu
timeout_hour : 12
job_array :
num_tasks : 5 # 5 different learning rates
launcher :
num_nodes : 1
gpus_per_node : 2
# Define sweep values
lr_values :
- 0.0001
- 0.00005
- 0.00001
- 0.000005
- 0.000001
scratch :
lr_transformer : ${lr_values.${submitit.job_array.task_index}}
Troubleshooting
Job Fails to Start
Check : SLURM limits and permissions
# View account limits
sacctmgr show assoc where user= $USER format=account,partition,qos,maxjobs,maxnodes
# Check partition availability
sinfo -p gpu
Out of Memory on Cluster
Increase memory request :
submitit :
mem_gb : 256 # Increase as needed
NCCL Timeout Errors
Increase timeout :
trainer :
distributed :
timeout_mins : 60 # Default is 30
Check network :
# Test node-to-node communication
export NCCL_DEBUG = INFO
Job Preempted Repeatedly
Use higher QoS or longer timeout :
submitit :
qos : high_priority
timeout_hour : 72
Slow Startup
Cause : Loading model on all GPUs
Solution : Load on CPU first (default):
trainer :
model :
device : cpus # Load to CPU, then move to GPU
On shared filesystems, many workers reading the same checkpoint simultaneously can cause slowdowns. Consider copying checkpoints to local node storage.
Best Practices
Checkpoint Strategy
Use shared filesystem for checkpoint_dir
Save frequently on preemptible partitions
Test resume before long jobs:
# Run 1 epoch, then resume
python -m sam3.train.train -c config.yaml --use-cluster 1
# After completion, resubmit to test resume
python -m sam3.train.train -c config.yaml --use-cluster 1
Resource Requests
Start small : Test on 1 node before scaling
Profile memory : Check nvidia-smi during training
Use seff : Analyze resource usage after jobs complete
Data Loading
Shared filesystem : Ensure dataset is on shared storage
Multiple workers : Use 8-12 workers per GPU
Pin memory : Enable for faster transfers:
data :
train :
pin_memory : True
num_workers : 10
Job Naming
submitit :
name : sam3_roboflow_v100_run1 # Helpful for queue viewing
Advanced Topics
Custom SLURM Parameters
Pass additional SLURM options:
submitit :
# srun arguments
srun_args :
cpu_bind : cores # CPU binding strategy
# Comment for accounting
comment : "SAM3 training experiment 42"
Node Selection
submitit :
# Exclude problematic nodes
exclude_nodes :
- node042
- node043
# Or explicitly include nodes
include_nodes :
- node010
- node011
- node012
- node013
Environment Variables
Set environment variables for the job:
trainer :
env_variables :
NCCL_DEBUG : INFO
NCCL_IB_DISABLE : 1
PYTORCH_CUDA_ALLOC_CONF : max_split_size_mb:512
Next Steps
Configuration Deep dive into configuration options
Evaluation Evaluate trained models