Prompting

Overview

SAM 3 supports multiple prompting modalities to segment objects in images and videos. You can use text descriptions, visual cues like points and boxes, or even provide example masks to guide segmentation.

SAM 3 can handle 270K+ unique concepts - over 50× more than existing benchmarks - making it truly open-vocabulary.

Prompt Types

SAM 3 supports five main types of prompts:

Text

Natural language descriptions

Points

Positive/negative click coordinates

Boxes

Bounding box coordinates

Masks

Binary segmentation masks

Exemplars

Example images or regions

Text Prompting

Text prompting allows you to describe objects using natural language. SAM 3’s language encoder processes text and fuses it with visual features.

Basic Text Prompts

from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image

# Initialize processor
processor = Sam3Processor(model)
image = Image.open("image.jpg")
inference_state = processor.set_image(image)

# Text prompt
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a dog"
)

masks = output["masks"]  # Predicted masks
boxes = output["boxes"]  # Bounding boxes
scores = output["scores"]  # Confidence scores

Advanced Text Prompts

SAM 3 handles complex, compositional text prompts:

# Specific attributes
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a player in white uniform"
)

# Spatial relationships
output = processor.set_text_prompt(
    state=inference_state,
    prompt="person sitting on the bench"
)

# Multiple attributes
output = processor.set_text_prompt(
    state=inference_state,
    prompt="red sports car with open doors"
)

The presence token in SAM 3’s architecture helps distinguish between similar prompts like “player in white” vs “player in red”.

Text Encoding Architecture

Text prompts are processed through a language backbone:

source/sam3/model/vl_combiner.py

def forward_text(self, captions, input_boxes=None, device="cuda"):
    """Encode text prompts."""
    text_attention_mask, text_memory, text_embeds = self.language_backbone(
        captions, input_boxes, device=device
    )
    return {
        "language_features": text_memory,
        "language_mask": text_attention_mask,
        "language_embeds": text_embeds
    }

The encoded text features are then fused with image features in the transformer encoder through cross-attention.

Geometric Prompting

Geometric prompts include points, boxes, and masks. These are encoded by the SequenceGeometryEncoder.

Point Prompts

Points can be positive (foreground) or negative (background):

# Positive and negative points
points = [[100, 200], [150, 250]]  # (x, y) coordinates
point_labels = [1, 0]  # 1 = positive, 0 = negative

output = processor.set_point_prompt(
    state=inference_state,
    points=points,
    point_labels=point_labels
)

Point Encoding (source/sam3/model/geometry_encoders.py:589-630):

Direct Projection

Points are directly projected to d_model dimensions

Pooling

Features are sampled from the image backbone at point locations

Positional Encoding

Sine-cosine positional encoding is applied

Label Embedding

Positive/negative labels are embedded and added

Box Prompts

Bounding boxes define regions of interest:

# Box in [x, y, width, height] format
boxes = [[50, 50, 200, 300]]
box_labels = [1]  # 1 = positive box

output = processor.set_box_prompt(
    state=inference_state,
    boxes=boxes,
    box_labels=box_labels
)

Box Encoding Options:

Direct Projection: Box coordinates → linear layer → embeddings
ROI Pooling: Extract features via ROI Align at box locations
Positional Encoding: Encode box center and size with sine-cosine encoding

source/sam3/model/geometry_encoders.py

class SequenceGeometryEncoder(nn.Module):
    def __init__(
        self,
        encode_boxes_as_points: bool,
        boxes_direct_project: bool,
        boxes_pool: bool,
        boxes_pos_enc: bool,
        ...
    ):
        # Multiple box encoding strategies
        if boxes_direct_project:
            self.boxes_direct_project = nn.Linear(4, d_model)
        if boxes_pool:
            self.boxes_pool_project = nn.Conv2d(d_model, d_model, roi_size)
        if boxes_pos_enc:
            self.boxes_pos_enc_project = nn.Linear(d_model + 2, d_model)

Can boxes be converted to points?

Yes! SAM 3 can encode boxes as two corner points (top-left and bottom-right) when encode_boxes_as_points=True. This unifies the representation and uses 6 label types:

Regular point (positive/negative)
Top-left corner (positive/negative)
Bottom-right corner (positive/negative)

Mask Prompts

Mask prompts provide dense segmentation guidance:

import numpy as np

# Binary mask (H, W)
mask = np.zeros((height, width), dtype=np.uint8)
mask[100:200, 100:200] = 1  # Mark region

output = processor.set_mask_prompt(
    state=inference_state,
    mask=mask
)

Mask Encoding (source/sam3/model/geometry_encoders.py:683-715):

Downsampling

Masks are downsampled to match feature map resolution

Feature Fusion

Downsampled masks are fused with image features

Positional Encoding

Spatial positional encoding is computed

Flattening

Mask features are flattened into a sequence of tokens

Combining Multiple Prompts

SAM 3’s Prompt class allows combining different prompt types:

source/sam3/model/geometry_encoders.py

class Prompt:
    """Utility class to manipulate geometric prompts."""
    
    def __init__(
        self,
        box_embeddings=None,
        box_mask=None,
        point_embeddings=None,
        point_mask=None,
        box_labels=None,
        point_labels=None,
        mask_embeddings=None,
        mask_mask=None,
        mask_labels=None,
    ):
        # Stores all prompt types with attention masks and labels

Example: Text + Points

# Start with text
output = processor.set_text_prompt(
    state=inference_state,
    prompt="a dog"
)

# Refine with points
output = processor.set_point_prompt(
    state=inference_state,
    points=[[100, 150]],
    point_labels=[1],
    add_to_existing=True  # Combine with existing text prompt
)

Example: Text + Boxes

# Combine text and box
output = processor.set_text_and_box_prompt(
    state=inference_state,
    prompt="person",
    boxes=[[50, 50, 200, 300]]
)

Prompt Encoding Pipeline

The complete prompt encoding pipeline:

source/sam3/model/sam3_image.py

def _encode_prompt(
    self,
    backbone_out,
    find_input,
    geometric_prompt,
    encode_text=True,
):
    # 1. Get text features
    txt_feats = backbone_out["language_features"][:, txt_ids]
    txt_masks = backbone_out["language_mask"][txt_ids]
    
    # 2. Get image features
    img_feats, img_pos_embeds = self._get_img_feats(backbone_out, img_ids)
    
    # 3. Encode geometry (points, boxes, masks)
    geo_feats, geo_masks = self.geometry_encoder(
        geo_prompt=geometric_prompt,
        img_feats=img_feats,
        img_sizes=vis_feat_sizes,
        img_pos_embeds=img_pos_embeds,
    )
    
    # 4. Concatenate all prompts
    prompt = torch.cat([txt_feats, geo_feats], dim=0)
    prompt_mask = torch.cat([txt_masks, geo_masks], dim=1)
    
    return prompt, prompt_mask

All prompts are concatenated into a unified sequence that the transformer encoder processes together.

Video Prompting

For video segmentation, prompts can be added on specific frames:

from sam3.model_builder import build_sam3_video_predictor

predictor = build_sam3_video_predictor()

# Start session
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video.mp4"
})

session_id = response["session_id"]

# Add text prompt on frame 0
response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person wearing red shirt"
})

# Add refinement points on frame 10
response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 10,
    "points": [[200, 300]],
    "point_labels": [1]
})

# Propagate to all frames
for frame_output in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "both"
}):
    frame_idx = frame_output["frame_index"]
    masks = frame_output["outputs"]["masks"]

Prompt Best Practices

When should I use text vs geometric prompts?

Use text when:

You want to segment all instances of a concept
The concept is well-defined (“dog”, “car”, “person in blue”)
You want open-vocabulary capabilities

Use geometric prompts when:

You need to specify exact instances
The concept is ambiguous or hard to describe
You want interactive refinement
You’re doing instance segmentation tasks

How specific should text prompts be?

More specific is generally better:

✅ “red sports car” → Better than “car”
✅ “person wearing blue jacket” → Better than “person”
✅ “golden retriever” → Better than “dog”

SAM 3 handles compositional prompts well, so include attributes like color, size, position, and state.

Can I use negative text prompts?

SAM 3’s presence token enables handling negative cases where no matching object exists. If a prompt doesn’t match any objects in the image, the model will output empty masks with low scores.

How many prompts can I combine?

You can combine multiple prompt types simultaneously:

Text + Points
Text + Boxes
Text + Boxes + Points
Points + Masks

The Prompt class concatenates all prompts into a unified sequence for the transformer to process.

Prompt Label Types

Different prompts use different label schemes:

Prompt Type	Labels	Meaning
Points	0, 1	0 = background, 1 = foreground
Boxes	0, 1	0 = negative, 1 = positive
Boxes as Points	0-5	0/1 = regular point, 2/3 = top-left, 4/5 = bottom-right
Masks	0, 1	0 = negative, 1 = positive
Text	N/A	No explicit labels (always positive)

All geometric prompts support attention masks to handle variable-length sequences efficiently using padding.

Next Steps

Image Segmentation

Learn the complete image segmentation workflow

Video Segmentation

Understand video tracking and propagation

Image Inference

See prompting examples for images

Video Inference

See prompting examples for videos

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

Overview

Prompt Types

Text

Points

Boxes

Masks

Exemplars

Text Prompting

Basic Text Prompts

Advanced Text Prompts

Text Encoding Architecture

Geometric Prompting

Point Prompts

Box Prompts

Mask Prompts

Combining Multiple Prompts

Example: Text + Points

Example: Text + Boxes

Prompt Encoding Pipeline

Video Prompting

Prompt Best Practices

Prompt Label Types

Next Steps

Image Segmentation

Video Segmentation

Image Inference

Video Inference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Overview

​Prompt Types

Text

Points

Boxes

Masks

Exemplars

​Text Prompting

​Basic Text Prompts

​Advanced Text Prompts

​Text Encoding Architecture

​Geometric Prompting

​Point Prompts

​Box Prompts

​Mask Prompts

​Combining Multiple Prompts

​Example: Text + Points

​Example: Text + Boxes

​Prompt Encoding Pipeline

​Video Prompting

​Prompt Best Practices

​Prompt Label Types

​Next Steps

Image Segmentation

Video Segmentation

Image Inference

Video Inference

Build docs developers (and LLMs) love

Overview

Prompt Types

Text Prompting

Basic Text Prompts

Advanced Text Prompts

Text Encoding Architecture

Geometric Prompting

Point Prompts

Box Prompts

Mask Prompts

Combining Multiple Prompts

Example: Text + Points

Example: Text + Boxes

Prompt Encoding Pipeline

Video Prompting

Prompt Best Practices

Prompt Label Types

Next Steps