Skip to main content

Overview

SAM 3 supports multiple prompting modalities to segment objects in images and videos. You can use text descriptions, visual cues like points and boxes, or even provide example masks to guide segmentation.
SAM 3 can handle 270K+ unique concepts - over 50× more than existing benchmarks - making it truly open-vocabulary.

Prompt Types

SAM 3 supports five main types of prompts:

Text

Natural language descriptions

Points

Positive/negative click coordinates

Boxes

Bounding box coordinates

Masks

Binary segmentation masks

Exemplars

Example images or regions

Text Prompting

Text prompting allows you to describe objects using natural language. SAM 3’s language encoder processes text and fuses it with visual features.

Basic Text Prompts

from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image

# Initialize processor
processor = Sam3Processor(model)
image = Image.open("image.jpg")
inference_state = processor.set_image(image)

# Text prompt
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a dog"
)

masks = output["masks"]  # Predicted masks
boxes = output["boxes"]  # Bounding boxes
scores = output["scores"]  # Confidence scores

Advanced Text Prompts

SAM 3 handles complex, compositional text prompts:
# Specific attributes
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a player in white uniform"
)

# Spatial relationships
output = processor.set_text_prompt(
    state=inference_state,
    prompt="person sitting on the bench"
)

# Multiple attributes
output = processor.set_text_prompt(
    state=inference_state,
    prompt="red sports car with open doors"
)
The presence token in SAM 3’s architecture helps distinguish between similar prompts like “player in white” vs “player in red”.

Text Encoding Architecture

Text prompts are processed through a language backbone:
source/sam3/model/vl_combiner.py
def forward_text(self, captions, input_boxes=None, device="cuda"):
    """Encode text prompts."""
    text_attention_mask, text_memory, text_embeds = self.language_backbone(
        captions, input_boxes, device=device
    )
    return {
        "language_features": text_memory,
        "language_mask": text_attention_mask,
        "language_embeds": text_embeds
    }
The encoded text features are then fused with image features in the transformer encoder through cross-attention.

Geometric Prompting

Geometric prompts include points, boxes, and masks. These are encoded by the SequenceGeometryEncoder.

Point Prompts

Points can be positive (foreground) or negative (background):
# Positive and negative points
points = [[100, 200], [150, 250]]  # (x, y) coordinates
point_labels = [1, 0]  # 1 = positive, 0 = negative

output = processor.set_point_prompt(
    state=inference_state,
    points=points,
    point_labels=point_labels
)
Point Encoding (source/sam3/model/geometry_encoders.py:589-630):
1

Direct Projection

Points are directly projected to d_model dimensions
2

Pooling

Features are sampled from the image backbone at point locations
3

Positional Encoding

Sine-cosine positional encoding is applied
4

Label Embedding

Positive/negative labels are embedded and added

Box Prompts

Bounding boxes define regions of interest:
# Box in [x, y, width, height] format
boxes = [[50, 50, 200, 300]]
box_labels = [1]  # 1 = positive box

output = processor.set_box_prompt(
    state=inference_state,
    boxes=boxes,
    box_labels=box_labels
)
Box Encoding Options:
  1. Direct Projection: Box coordinates → linear layer → embeddings
  2. ROI Pooling: Extract features via ROI Align at box locations
  3. Positional Encoding: Encode box center and size with sine-cosine encoding
source/sam3/model/geometry_encoders.py
class SequenceGeometryEncoder(nn.Module):
    def __init__(
        self,
        encode_boxes_as_points: bool,
        boxes_direct_project: bool,
        boxes_pool: bool,
        boxes_pos_enc: bool,
        ...
    ):
        # Multiple box encoding strategies
        if boxes_direct_project:
            self.boxes_direct_project = nn.Linear(4, d_model)
        if boxes_pool:
            self.boxes_pool_project = nn.Conv2d(d_model, d_model, roi_size)
        if boxes_pos_enc:
            self.boxes_pos_enc_project = nn.Linear(d_model + 2, d_model)
Yes! SAM 3 can encode boxes as two corner points (top-left and bottom-right) when encode_boxes_as_points=True. This unifies the representation and uses 6 label types:
  • Regular point (positive/negative)
  • Top-left corner (positive/negative)
  • Bottom-right corner (positive/negative)

Mask Prompts

Mask prompts provide dense segmentation guidance:
import numpy as np

# Binary mask (H, W)
mask = np.zeros((height, width), dtype=np.uint8)
mask[100:200, 100:200] = 1  # Mark region

output = processor.set_mask_prompt(
    state=inference_state,
    mask=mask
)
Mask Encoding (source/sam3/model/geometry_encoders.py:683-715):
1

Downsampling

Masks are downsampled to match feature map resolution
2

Feature Fusion

Downsampled masks are fused with image features
3

Positional Encoding

Spatial positional encoding is computed
4

Flattening

Mask features are flattened into a sequence of tokens

Combining Multiple Prompts

SAM 3’s Prompt class allows combining different prompt types:
source/sam3/model/geometry_encoders.py
class Prompt:
    """Utility class to manipulate geometric prompts."""
    
    def __init__(
        self,
        box_embeddings=None,
        box_mask=None,
        point_embeddings=None,
        point_mask=None,
        box_labels=None,
        point_labels=None,
        mask_embeddings=None,
        mask_mask=None,
        mask_labels=None,
    ):
        # Stores all prompt types with attention masks and labels

Example: Text + Points

# Start with text
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a dog"
)

# Refine with points
output = processor.set_point_prompt(
    state=inference_state,
    points=[[100, 150]],
    point_labels=[1],
    add_to_existing=True  # Combine with existing text prompt
)

Example: Text + Boxes

# Combine text and box
output = processor.set_text_and_box_prompt(
    state=inference_state,
    prompt="person",
    boxes=[[50, 50, 200, 300]]
)

Prompt Encoding Pipeline

The complete prompt encoding pipeline:
source/sam3/model/sam3_image.py
def _encode_prompt(
    self,
    backbone_out,
    find_input,
    geometric_prompt,
    encode_text=True,
):
    # 1. Get text features
    txt_feats = backbone_out["language_features"][:, txt_ids]
    txt_masks = backbone_out["language_mask"][txt_ids]
    
    # 2. Get image features
    img_feats, img_pos_embeds = self._get_img_feats(backbone_out, img_ids)
    
    # 3. Encode geometry (points, boxes, masks)
    geo_feats, geo_masks = self.geometry_encoder(
        geo_prompt=geometric_prompt,
        img_feats=img_feats,
        img_sizes=vis_feat_sizes,
        img_pos_embeds=img_pos_embeds,
    )
    
    # 4. Concatenate all prompts
    prompt = torch.cat([txt_feats, geo_feats], dim=0)
    prompt_mask = torch.cat([txt_masks, geo_masks], dim=1)
    
    return prompt, prompt_mask
All prompts are concatenated into a unified sequence that the transformer encoder processes together.

Video Prompting

For video segmentation, prompts can be added on specific frames:
from sam3.model_builder import build_sam3_video_predictor

predictor = build_sam3_video_predictor()

# Start session
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video.mp4"
})

session_id = response["session_id"]

# Add text prompt on frame 0
response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person wearing red shirt"
})

# Add refinement points on frame 10
response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 10,
    "points": [[200, 300]],
    "point_labels": [1]
})

# Propagate to all frames
for frame_output in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "both"
}):
    frame_idx = frame_output["frame_index"]
    masks = frame_output["outputs"]["masks"]

Prompt Best Practices

Use text when:
  • You want to segment all instances of a concept
  • The concept is well-defined (“dog”, “car”, “person in blue”)
  • You want open-vocabulary capabilities
Use geometric prompts when:
  • You need to specify exact instances
  • The concept is ambiguous or hard to describe
  • You want interactive refinement
  • You’re doing instance segmentation tasks
More specific is generally better:
  • ✅ “red sports car” → Better than “car”
  • ✅ “person wearing blue jacket” → Better than “person”
  • ✅ “golden retriever” → Better than “dog”
SAM 3 handles compositional prompts well, so include attributes like color, size, position, and state.
SAM 3’s presence token enables handling negative cases where no matching object exists. If a prompt doesn’t match any objects in the image, the model will output empty masks with low scores.
You can combine multiple prompt types simultaneously:
  • Text + Points
  • Text + Boxes
  • Text + Boxes + Points
  • Points + Masks
The Prompt class concatenates all prompts into a unified sequence for the transformer to process.

Prompt Label Types

Different prompts use different label schemes:
Prompt TypeLabelsMeaning
Points0, 10 = background, 1 = foreground
Boxes0, 10 = negative, 1 = positive
Boxes as Points0-50/1 = regular point, 2/3 = top-left, 4/5 = bottom-right
Masks0, 10 = negative, 1 = positive
TextN/ANo explicit labels (always positive)
All geometric prompts support attention masks to handle variable-length sequences efficiently using padding.

Next Steps

Image Segmentation

Learn the complete image segmentation workflow

Video Segmentation

Understand video tracking and propagation

Image Inference

See prompting examples for images

Video Inference

See prompting examples for videos

Build docs developers (and LLMs) love