Architecture

Overview

SAM 3 is a unified foundation model for promptable segmentation in images and videos. With 848M parameters, it introduces a novel decoupled detector-tracker design that minimizes task interference and scales efficiently with diverse data.

Core Architecture

SAM 3 consists of three main components:

Vision Encoder - Shared backbone for feature extraction
Detector - DETR-based detection and segmentation
Tracker - SAM 2-style memory-based tracking

Decoupled Design

Unlike previous unified models, SAM 3 uses a decoupled detector-tracker architecture:

The detector handles text-prompted and exemplar-based segmentation
The tracker manages temporal consistency and interactive refinement
Both components share the same vision encoder to reduce parameters and computation

This separation prevents task interference and allows each component to specialize in its domain.

Vision Encoder

The vision encoder is based on a dual ViT-Det architecture that produces multi-scale features.

source/sam3/model/vl_combiner.py

class SAM3VLBackbone(nn.Module):
    """Combines vision and language backbones."""
    
    def __init__(self, visual: Sam3DualViTDetNeck, text, ...):
        self.vision_backbone = visual
        self.language_backbone = text

Vision Processing Pipeline

Image Input

Input images are processed at multiple resolutions

Feature Extraction

The ViT backbone extracts hierarchical features at different scales

Feature Pyramid

Features are organized into a Feature Pyramid Network (FPN) structure

Dual Output

Produces features for both SAM 3 detector and SAM 2 tracker

The vision encoder produces two sets of features:

SAM 3 features for the detector with text-visual fusion
SAM 2 features for the tracker for interactive refinement

Detector Architecture

The detector is a DETR-based transformer model that predicts object masks and bounding boxes.

source/sam3/model/sam3_image.py

class Sam3Image(torch.nn.Module):
    def __init__(
        self,
        backbone: SAM3VLBackbone,
        transformer,
        input_geometry_encoder,
        segmentation_head=None,
        ...
    ):
        self.backbone = backbone
        self.geometry_encoder = input_geometry_encoder
        self.transformer = transformer
        self.segmentation_head = segmentation_head

Detector Components

Transformer Encoder

Fuses image features with text and geometric prompts through cross-attention

Transformer Decoder

Uses learned object queries to predict instance-level outputs

Segmentation Head

Generates high-resolution mask predictions from decoder outputs

Presence Token

Novel token that improves discrimination between similar prompts

Transformer Architecture

The transformer follows a standard encoder-decoder design: Encoder (source/sam3/model/encoder.py:380-461):

Processes multi-level image features
Performs cross-attention with text prompts
Adds pooled text features to image features for fusion

Decoder (source/sam3/model/decoder.py:407-608):

Uses learned object queries
Applies self-attention and cross-attention with image features
Predicts bounding boxes and classification scores
Supports iterative box refinement

The decoder uses box-aware positional encoding (boxRPB) to improve localization accuracy by encoding the relationship between queries and spatial positions.

Presence Token Innovation

A key architectural innovation is the presence token - a learned token that helps discriminate between closely related prompts.

source/sam3/model/decoder.py

if self.presence_token is not None:
    # Presence token helps distinguish "player in white" vs "player in red"
    presence_out = self.presence_token.weight[None].expand(1, bs, -1)

The presence token:

Participates in self-attention with object queries
Predicts whether objects matching the prompt exist in the image
Improves performance on negative prompts (when no matching object exists)
Helps distinguish between similar but different concepts

Tracker Architecture

The tracker inherits the SAM 2 architecture for video segmentation and interactive refinement.

source/sam3/model/sam3_tracker_base.py

class Sam3TrackerBase(torch.nn.Module):
    def __init__(
        self,
        backbone,
        transformer,  # Encoder-only transformer
        maskmem_backbone,  # Memory encoder
        num_maskmem=7,  # Number of memory frames
        ...
    ):
        self.backbone = backbone
        self.transformer = transformer  # Encoder-only
        self.maskmem_backbone = maskmem_backbone

Memory-Based Tracking

The tracker uses a memory attention mechanism to maintain temporal consistency:

Memory Encoding

Previous frame masks are encoded into spatial memory features

Object Pointers

Output tokens from the mask decoder become object pointers

Temporal Encoding

Object pointers are augmented with temporal positional encoding

Memory Attention

Current frame features attend to past frame memories via cross-attention

Tracker Components

Component	Purpose	Location
SAM Prompt Encoder	Encodes point, box, and mask prompts	`source/sam3/sam/prompt_encoder.py`
SAM Mask Decoder	Two-way transformer for mask prediction	`source/sam3/sam/mask_decoder.py`
Memory Encoder	Encodes previous masks into memory	`source/sam3/model/memory.py`
Memory Attention	Cross-attention to past frames	`source/sam3/model/sam3_tracker_base.py:559-794`

The tracker maintains up to num_maskmem=7 memory frames by default:

1 conditioning frame (initial prompt)
6 previous frames for temporal context

Shared Vision Encoder

Both detector and tracker share the same vision encoder to maximize efficiency:

source/sam3/model/sam3_tracker_base.py

def forward_image(self, img_batch):
    """Get image features using the SAM3 backbone."""
    backbone_out = self.backbone.forward_image(img_batch)["sam2_backbone_out"]
    # Features are shared between detector and tracker
    return backbone_out

This sharing:

Reduces the total parameter count
Enables faster inference
Ensures consistency between detection and tracking features

Model Size and Performance

How many parameters does SAM 3 have?

SAM 3 has 848M parameters in total. The vision encoder is shared between the detector and tracker, which significantly reduces the parameter count compared to having separate encoders.

What is the computational cost?

The main computational cost comes from:

Vision encoder: Processes images at multiple scales
Detector transformer: 6 encoder layers + 6 decoder layers
Tracker transformer: Encoder-only architecture with memory attention

The decoupled design allows running only the detector for image tasks or only the tracker for interactive refinement.

Can I use torch.compile with SAM 3?

Yes! SAM 3 supports torch.compile for performance optimization:

Vision backbone can be compiled separately
Decoder supports compilation after warmup
Enable with compile=True in model builder

Key Innovations

Decoupled Architecture

Separate detector and tracker prevent task interference while sharing the vision encoder

Presence Token

Novel token improves discrimination between similar prompts and handles negative cases

Memory Attention

Efficient temporal modeling through object pointers and spatial memory

Multi-Scale Features

Hierarchical feature pyramid enables detection at different scales

Next Steps

Prompting

Learn about different prompting types

Image Segmentation

Understand image segmentation workflow

Video Segmentation

Explore video tracking and segmentation

Quick Start

Get started with SAM 3

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

Architecture

Overview

Core Architecture

Decoupled Design

Vision Encoder

Vision Processing Pipeline

Detector Architecture

Detector Components

Transformer Encoder

Transformer Decoder

Segmentation Head

Presence Token

Transformer Architecture

Presence Token Innovation

Tracker Architecture

Memory-Based Tracking

Tracker Components

Shared Vision Encoder

Model Size and Performance

Key Innovations

Decoupled Architecture

Presence Token

Memory Attention

Multi-Scale Features

Next Steps

Prompting

Image Segmentation

Video Segmentation

Quick Start

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Overview

​Core Architecture

​Decoupled Design

​Vision Encoder

​Vision Processing Pipeline

​Detector Architecture

​Detector Components

Transformer Encoder

Transformer Decoder

Segmentation Head

Presence Token

​Transformer Architecture

​Presence Token Innovation

​Tracker Architecture

​Memory-Based Tracking

​Tracker Components

​Shared Vision Encoder

​Model Size and Performance

​Key Innovations

Decoupled Architecture

Presence Token

Memory Attention

Multi-Scale Features

​Next Steps

Prompting

Image Segmentation

Video Segmentation

Quick Start

Build docs developers (and LLMs) love

Overview

Core Architecture

Decoupled Design

Vision Encoder

Vision Processing Pipeline

Detector Architecture

Detector Components

Transformer Architecture

Presence Token Innovation

Tracker Architecture

Memory-Based Tracking

Tracker Components

Shared Vision Encoder

Model Size and Performance

Key Innovations

Next Steps