Skip to main content
SAM 3 supports fine-tuning on custom datasets for both detection and segmentation tasks. The training framework is built on PyTorch with distributed training support via DDP (DistributedDataParallel).

Training Capabilities

SAM 3’s training system provides:
  • Fine-tuning on custom datasets with COCO-format annotations
  • Detection and segmentation training modes
  • Distributed training across multiple GPUs and nodes
  • Flexible configuration using Hydra YAML configs
  • Mixed precision training with AMP support
  • SLURM cluster integration for large-scale training
  • Checkpoint management with automatic resumption

Use Cases

Domain Adaptation

Fine-tune SAM 3 on domain-specific data to improve performance:
  • Medical imaging (X-rays, MRI scans)
  • Aerial/satellite imagery
  • Industrial defect detection
  • Wildlife monitoring
  • Custom object categories

Task Specialization

Adapt the model for specific tasks:
  • Small object detection
  • Dense instance segmentation
  • Few-shot learning scenarios
  • Custom prompt engineering

Training Modes

Detection Only

Train for bounding box detection without mask outputs:
scratch:
  enable_segmentation: False
This mode is faster and requires less memory, suitable for detection-focused applications.

Detection + Segmentation

Train for both bounding boxes and instance masks:
scratch:
  enable_segmentation: True
Enables full instance segmentation capabilities with pixel-level masks.

Data Requirements

Annotation Format

Training data must be in COCO JSON format:
{
  "images": [
    {
      "id": 1,
      "file_name": "image_001.jpg",
      "width": 1920,
      "height": 1080
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [x, y, width, height],
      "segmentation": [...],  // Optional for segmentation
      "area": 12345,
      "iscrowd": 0
    }
  ],
  "categories": [
    {"id": 1, "name": "object_class"}
  ]
}

Dataset Structure

dataset/
├── train/
│   ├── images/
│   │   ├── image_001.jpg
│   │   └── ...
│   └── _annotations.coco.json
└── test/
    ├── images/
    └── _annotations.coco.json

Training Architecture

Components

The training system consists of:
  • Trainer (sam3.train.trainer.Trainer) - Main training loop coordinator
  • Model - SAM 3 architecture with vision and language backbones
  • Loss Functions - Configurable loss components for detection/segmentation
  • Optimizers - AdamW with custom parameter groups
  • Schedulers - Learning rate scheduling with warmup/cooldown
  • Data Loaders - Multi-worker data loading with transforms

Training Flow

  1. Initialization - Load model, setup distributed backend
  2. Data Loading - Apply transforms, create batches
  3. Forward Pass - Process images through model
  4. Loss Calculation - Compute detection/segmentation losses
  5. Backward Pass - Gradient computation with AMP
  6. Optimization - Update model parameters
  7. Validation - Periodic evaluation on validation set
  8. Checkpointing - Save model state for resumption

Performance Considerations

Memory Requirements

  • Minimum: 1 GPU with 16GB VRAM for small datasets
  • Recommended: Multiple GPUs with 24GB+ VRAM
  • Resolution: Default 1008x1008 (configurable)

Training Speed

Typical training speeds:
  • Single GPU: ~100-200 images/hour
  • 8 GPUs: ~800-1600 images/hour
  • Depends on image resolution, batch size, and hardware
Training SAM 3 requires significant computational resources. For best results, use multiple GPUs or a SLURM cluster. Single GPU training is possible but slow.

Next Steps

Setup Environment

Install dependencies and prepare your environment

Configuration

Learn about training configuration options

Local Training

Run training on local GPUs

Cluster Training

Scale to SLURM clusters

Build docs developers (and LLMs) love