Training Capabilities
SAM 3’s training system provides:- Fine-tuning on custom datasets with COCO-format annotations
- Detection and segmentation training modes
- Distributed training across multiple GPUs and nodes
- Flexible configuration using Hydra YAML configs
- Mixed precision training with AMP support
- SLURM cluster integration for large-scale training
- Checkpoint management with automatic resumption
Use Cases
Domain Adaptation
Fine-tune SAM 3 on domain-specific data to improve performance:- Medical imaging (X-rays, MRI scans)
- Aerial/satellite imagery
- Industrial defect detection
- Wildlife monitoring
- Custom object categories
Task Specialization
Adapt the model for specific tasks:- Small object detection
- Dense instance segmentation
- Few-shot learning scenarios
- Custom prompt engineering
Training Modes
Detection Only
Train for bounding box detection without mask outputs:Detection + Segmentation
Train for both bounding boxes and instance masks:Data Requirements
Annotation Format
Training data must be in COCO JSON format:Dataset Structure
Training Architecture
Components
The training system consists of:- Trainer (
sam3.train.trainer.Trainer) - Main training loop coordinator - Model - SAM 3 architecture with vision and language backbones
- Loss Functions - Configurable loss components for detection/segmentation
- Optimizers - AdamW with custom parameter groups
- Schedulers - Learning rate scheduling with warmup/cooldown
- Data Loaders - Multi-worker data loading with transforms
Training Flow
- Initialization - Load model, setup distributed backend
- Data Loading - Apply transforms, create batches
- Forward Pass - Process images through model
- Loss Calculation - Compute detection/segmentation losses
- Backward Pass - Gradient computation with AMP
- Optimization - Update model parameters
- Validation - Periodic evaluation on validation set
- Checkpointing - Save model state for resumption
Performance Considerations
Memory Requirements
- Minimum: 1 GPU with 16GB VRAM for small datasets
- Recommended: Multiple GPUs with 24GB+ VRAM
- Resolution: Default 1008x1008 (configurable)
Training Speed
Typical training speeds:- Single GPU: ~100-200 images/hour
- 8 GPUs: ~800-1600 images/hour
- Depends on image resolution, batch size, and hardware
Next Steps
Setup Environment
Install dependencies and prepare your environment
Configuration
Learn about training configuration options
Local Training
Run training on local GPUs
Cluster Training
Scale to SLURM clusters