Overview
SAM 3 is a unified foundation model for promptable segmentation in images and videos. With 848M parameters, it introduces a novel decoupled detector-tracker design that minimizes task interference and scales efficiently with diverse data.
Core Architecture
SAM 3 consists of three main components:- Vision Encoder - Shared backbone for feature extraction
- Detector - DETR-based detection and segmentation
- Tracker - SAM 2-style memory-based tracking
Decoupled Design
Unlike previous unified models, SAM 3 uses a decoupled detector-tracker architecture:- The detector handles text-prompted and exemplar-based segmentation
- The tracker manages temporal consistency and interactive refinement
- Both components share the same vision encoder to reduce parameters and computation
Vision Encoder
The vision encoder is based on a dual ViT-Det architecture that produces multi-scale features.source/sam3/model/vl_combiner.py
Vision Processing Pipeline
The vision encoder produces two sets of features:
- SAM 3 features for the detector with text-visual fusion
- SAM 2 features for the tracker for interactive refinement
Detector Architecture
The detector is a DETR-based transformer model that predicts object masks and bounding boxes.source/sam3/model/sam3_image.py
Detector Components
Transformer Encoder
Fuses image features with text and geometric prompts through cross-attention
Transformer Decoder
Uses learned object queries to predict instance-level outputs
Segmentation Head
Generates high-resolution mask predictions from decoder outputs
Presence Token
Novel token that improves discrimination between similar prompts
Transformer Architecture
The transformer follows a standard encoder-decoder design: Encoder (source/sam3/model/encoder.py:380-461):
- Processes multi-level image features
- Performs cross-attention with text prompts
- Adds pooled text features to image features for fusion
source/sam3/model/decoder.py:407-608):
- Uses learned object queries
- Applies self-attention and cross-attention with image features
- Predicts bounding boxes and classification scores
- Supports iterative box refinement
Presence Token Innovation
A key architectural innovation is the presence token - a learned token that helps discriminate between closely related prompts.source/sam3/model/decoder.py
- Participates in self-attention with object queries
- Predicts whether objects matching the prompt exist in the image
- Improves performance on negative prompts (when no matching object exists)
- Helps distinguish between similar but different concepts
Tracker Architecture
The tracker inherits the SAM 2 architecture for video segmentation and interactive refinement.source/sam3/model/sam3_tracker_base.py
Memory-Based Tracking
The tracker uses a memory attention mechanism to maintain temporal consistency:Tracker Components
| Component | Purpose | Location |
|---|---|---|
| SAM Prompt Encoder | Encodes point, box, and mask prompts | source/sam3/sam/prompt_encoder.py |
| SAM Mask Decoder | Two-way transformer for mask prediction | source/sam3/sam/mask_decoder.py |
| Memory Encoder | Encodes previous masks into memory | source/sam3/model/memory.py |
| Memory Attention | Cross-attention to past frames | source/sam3/model/sam3_tracker_base.py:559-794 |
The tracker maintains up to
num_maskmem=7 memory frames by default:- 1 conditioning frame (initial prompt)
- 6 previous frames for temporal context
Shared Vision Encoder
Both detector and tracker share the same vision encoder to maximize efficiency:source/sam3/model/sam3_tracker_base.py
- Reduces the total parameter count
- Enables faster inference
- Ensures consistency between detection and tracking features
Model Size and Performance
How many parameters does SAM 3 have?
How many parameters does SAM 3 have?
SAM 3 has 848M parameters in total. The vision encoder is shared between the detector and tracker, which significantly reduces the parameter count compared to having separate encoders.
What is the computational cost?
What is the computational cost?
The main computational cost comes from:
- Vision encoder: Processes images at multiple scales
- Detector transformer: 6 encoder layers + 6 decoder layers
- Tracker transformer: Encoder-only architecture with memory attention
Can I use torch.compile with SAM 3?
Can I use torch.compile with SAM 3?
Yes! SAM 3 supports
torch.compile for performance optimization:- Vision backbone can be compiled separately
- Decoder supports compilation after warmup
- Enable with
compile=Truein model builder
Key Innovations
Decoupled Architecture
Separate detector and tracker prevent task interference while sharing the vision encoder
Presence Token
Novel token improves discrimination between similar prompts and handles negative cases
Memory Attention
Efficient temporal modeling through object pointers and spatial memory
Multi-Scale Features
Hierarchical feature pyramid enables detection at different scales
Next Steps
Prompting
Learn about different prompting types
Image Segmentation
Understand image segmentation workflow
Video Segmentation
Explore video tracking and segmentation
Quick Start
Get started with SAM 3