Skip to main content

Overview

SAM 3 is a unified foundation model for promptable segmentation in images and videos. With 848M parameters, it introduces a novel decoupled detector-tracker design that minimizes task interference and scales efficiently with diverse data.
SAM 3 Architecture

Core Architecture

SAM 3 consists of three main components:
  1. Vision Encoder - Shared backbone for feature extraction
  2. Detector - DETR-based detection and segmentation
  3. Tracker - SAM 2-style memory-based tracking

Decoupled Design

Unlike previous unified models, SAM 3 uses a decoupled detector-tracker architecture:
  • The detector handles text-prompted and exemplar-based segmentation
  • The tracker manages temporal consistency and interactive refinement
  • Both components share the same vision encoder to reduce parameters and computation
This separation prevents task interference and allows each component to specialize in its domain.

Vision Encoder

The vision encoder is based on a dual ViT-Det architecture that produces multi-scale features.
source/sam3/model/vl_combiner.py
class SAM3VLBackbone(nn.Module):
    """Combines vision and language backbones."""
    
    def __init__(self, visual: Sam3DualViTDetNeck, text, ...):
        self.vision_backbone = visual
        self.language_backbone = text

Vision Processing Pipeline

1

Image Input

Input images are processed at multiple resolutions
2

Feature Extraction

The ViT backbone extracts hierarchical features at different scales
3

Feature Pyramid

Features are organized into a Feature Pyramid Network (FPN) structure
4

Dual Output

Produces features for both SAM 3 detector and SAM 2 tracker
The vision encoder produces two sets of features:
  • SAM 3 features for the detector with text-visual fusion
  • SAM 2 features for the tracker for interactive refinement

Detector Architecture

The detector is a DETR-based transformer model that predicts object masks and bounding boxes.
source/sam3/model/sam3_image.py
class Sam3Image(torch.nn.Module):
    def __init__(
        self,
        backbone: SAM3VLBackbone,
        transformer,
        input_geometry_encoder,
        segmentation_head=None,
        ...
    ):
        self.backbone = backbone
        self.geometry_encoder = input_geometry_encoder
        self.transformer = transformer
        self.segmentation_head = segmentation_head

Detector Components

Transformer Encoder

Fuses image features with text and geometric prompts through cross-attention

Transformer Decoder

Uses learned object queries to predict instance-level outputs

Segmentation Head

Generates high-resolution mask predictions from decoder outputs

Presence Token

Novel token that improves discrimination between similar prompts

Transformer Architecture

The transformer follows a standard encoder-decoder design: Encoder (source/sam3/model/encoder.py:380-461):
  • Processes multi-level image features
  • Performs cross-attention with text prompts
  • Adds pooled text features to image features for fusion
Decoder (source/sam3/model/decoder.py:407-608):
  • Uses learned object queries
  • Applies self-attention and cross-attention with image features
  • Predicts bounding boxes and classification scores
  • Supports iterative box refinement
The decoder uses box-aware positional encoding (boxRPB) to improve localization accuracy by encoding the relationship between queries and spatial positions.

Presence Token Innovation

A key architectural innovation is the presence token - a learned token that helps discriminate between closely related prompts.
source/sam3/model/decoder.py
if self.presence_token is not None:
    # Presence token helps distinguish "player in white" vs "player in red"
    presence_out = self.presence_token.weight[None].expand(1, bs, -1)
The presence token:
  • Participates in self-attention with object queries
  • Predicts whether objects matching the prompt exist in the image
  • Improves performance on negative prompts (when no matching object exists)
  • Helps distinguish between similar but different concepts

Tracker Architecture

The tracker inherits the SAM 2 architecture for video segmentation and interactive refinement.
source/sam3/model/sam3_tracker_base.py
class Sam3TrackerBase(torch.nn.Module):
    def __init__(
        self,
        backbone,
        transformer,  # Encoder-only transformer
        maskmem_backbone,  # Memory encoder
        num_maskmem=7,  # Number of memory frames
        ...
    ):
        self.backbone = backbone
        self.transformer = transformer  # Encoder-only
        self.maskmem_backbone = maskmem_backbone

Memory-Based Tracking

The tracker uses a memory attention mechanism to maintain temporal consistency:
1

Memory Encoding

Previous frame masks are encoded into spatial memory features
2

Object Pointers

Output tokens from the mask decoder become object pointers
3

Temporal Encoding

Object pointers are augmented with temporal positional encoding
4

Memory Attention

Current frame features attend to past frame memories via cross-attention

Tracker Components

ComponentPurposeLocation
SAM Prompt EncoderEncodes point, box, and mask promptssource/sam3/sam/prompt_encoder.py
SAM Mask DecoderTwo-way transformer for mask predictionsource/sam3/sam/mask_decoder.py
Memory EncoderEncodes previous masks into memorysource/sam3/model/memory.py
Memory AttentionCross-attention to past framessource/sam3/model/sam3_tracker_base.py:559-794
The tracker maintains up to num_maskmem=7 memory frames by default:
  • 1 conditioning frame (initial prompt)
  • 6 previous frames for temporal context

Shared Vision Encoder

Both detector and tracker share the same vision encoder to maximize efficiency:
source/sam3/model/sam3_tracker_base.py
def forward_image(self, img_batch):
    """Get image features using the SAM3 backbone."""
    backbone_out = self.backbone.forward_image(img_batch)["sam2_backbone_out"]
    # Features are shared between detector and tracker
    return backbone_out
This sharing:
  • Reduces the total parameter count
  • Enables faster inference
  • Ensures consistency between detection and tracking features

Model Size and Performance

SAM 3 has 848M parameters in total. The vision encoder is shared between the detector and tracker, which significantly reduces the parameter count compared to having separate encoders.
The main computational cost comes from:
  • Vision encoder: Processes images at multiple scales
  • Detector transformer: 6 encoder layers + 6 decoder layers
  • Tracker transformer: Encoder-only architecture with memory attention
The decoupled design allows running only the detector for image tasks or only the tracker for interactive refinement.
Yes! SAM 3 supports torch.compile for performance optimization:
  • Vision backbone can be compiled separately
  • Decoder supports compilation after warmup
  • Enable with compile=True in model builder

Key Innovations

Decoupled Architecture

Separate detector and tracker prevent task interference while sharing the vision encoder

Presence Token

Novel token improves discrimination between similar prompts and handles negative cases

Memory Attention

Efficient temporal modeling through object pointers and spatial memory

Multi-Scale Features

Hierarchical feature pyramid enables detection at different scales

Next Steps

Prompting

Learn about different prompting types

Image Segmentation

Understand image segmentation workflow

Video Segmentation

Explore video tracking and segmentation

Quick Start

Get started with SAM 3

Build docs developers (and LLMs) love