Configuration Structure
Training configs are located insam3/train/configs/ and follow this hierarchy:
Basic Configuration Example
Here’s a minimal training configuration:Configuration Sections
Paths
Define dataset and output paths:Launcher
Configure distributed training resources:Submitit (SLURM)
SLURM cluster configuration:Trainer
Main training parameters:Model
Model architecture configuration:Data
Dataset and dataloader configuration:Transforms
Data augmentation pipeline:Optimizer
Optimizer and scheduler configuration:Loss Functions
Loss configuration for detection and segmentation:Checkpoint
Checkpoint saving configuration:Logging
Logging and monitoring:Distributed Training
Distributed training settings:CUDA Settings
CUDA optimization options:Scratch Parameters
Common training hyperparameters in thescratch section:
Configuration Tips
For Small Datasets
- Reduce learning rate:
lr_scale: 0.01 - More epochs:
max_epochs: 50 - Frequent validation:
val_epoch_freq: 5
For Large Datasets
- Standard learning rate:
lr_scale: 0.1 - Fewer epochs:
max_epochs: 20 - Less frequent validation:
val_epoch_freq: 10
For Memory Constraints
- Smaller resolution:
resolution: 512 - Gradient accumulation:
gradient_accumulation_steps: 4 - Reduce workers:
num_train_workers: 2
For Speed
- Disable segmentation:
enable_segmentation: False - Larger batch size:
train_batch_size: 2 - More workers:
num_train_workers: 16
Next Steps
Local Training
Run training with your configuration
Cluster Training
Scale to SLURM clusters