Overview
SAM 3 supports multiple prompting modalities to segment objects in images and videos. You can use text descriptions, visual cues like points and boxes, or even provide example masks to guide segmentation.SAM 3 can handle 270K+ unique concepts - over 50× more than existing benchmarks - making it truly open-vocabulary.
Prompt Types
SAM 3 supports five main types of prompts:Text
Natural language descriptions
Points
Positive/negative click coordinates
Boxes
Bounding box coordinates
Masks
Binary segmentation masks
Exemplars
Example images or regions
Text Prompting
Text prompting allows you to describe objects using natural language. SAM 3’s language encoder processes text and fuses it with visual features.Basic Text Prompts
Advanced Text Prompts
SAM 3 handles complex, compositional text prompts:Text Encoding Architecture
Text prompts are processed through a language backbone:source/sam3/model/vl_combiner.py
Geometric Prompting
Geometric prompts include points, boxes, and masks. These are encoded by theSequenceGeometryEncoder.
Point Prompts
Points can be positive (foreground) or negative (background):source/sam3/model/geometry_encoders.py:589-630):
Box Prompts
Bounding boxes define regions of interest:- Direct Projection: Box coordinates → linear layer → embeddings
- ROI Pooling: Extract features via ROI Align at box locations
- Positional Encoding: Encode box center and size with sine-cosine encoding
source/sam3/model/geometry_encoders.py
Can boxes be converted to points?
Can boxes be converted to points?
Yes! SAM 3 can encode boxes as two corner points (top-left and bottom-right) when
encode_boxes_as_points=True. This unifies the representation and uses 6 label types:- Regular point (positive/negative)
- Top-left corner (positive/negative)
- Bottom-right corner (positive/negative)
Mask Prompts
Mask prompts provide dense segmentation guidance:source/sam3/model/geometry_encoders.py:683-715):
Combining Multiple Prompts
SAM 3’sPrompt class allows combining different prompt types:
source/sam3/model/geometry_encoders.py
Example: Text + Points
Example: Text + Boxes
Prompt Encoding Pipeline
The complete prompt encoding pipeline:source/sam3/model/sam3_image.py
All prompts are concatenated into a unified sequence that the transformer encoder processes together.
Video Prompting
For video segmentation, prompts can be added on specific frames:Prompt Best Practices
When should I use text vs geometric prompts?
When should I use text vs geometric prompts?
Use text when:
- You want to segment all instances of a concept
- The concept is well-defined (“dog”, “car”, “person in blue”)
- You want open-vocabulary capabilities
- You need to specify exact instances
- The concept is ambiguous or hard to describe
- You want interactive refinement
- You’re doing instance segmentation tasks
How specific should text prompts be?
How specific should text prompts be?
More specific is generally better:
- ✅ “red sports car” → Better than “car”
- ✅ “person wearing blue jacket” → Better than “person”
- ✅ “golden retriever” → Better than “dog”
Can I use negative text prompts?
Can I use negative text prompts?
SAM 3’s presence token enables handling negative cases where no matching object exists. If a prompt doesn’t match any objects in the image, the model will output empty masks with low scores.
How many prompts can I combine?
How many prompts can I combine?
You can combine multiple prompt types simultaneously:
- Text + Points
- Text + Boxes
- Text + Boxes + Points
- Points + Masks
Prompt class concatenates all prompts into a unified sequence for the transformer to process.Prompt Label Types
Different prompts use different label schemes:| Prompt Type | Labels | Meaning |
|---|---|---|
| Points | 0, 1 | 0 = background, 1 = foreground |
| Boxes | 0, 1 | 0 = negative, 1 = positive |
| Boxes as Points | 0-5 | 0/1 = regular point, 2/3 = top-left, 4/5 = bottom-right |
| Masks | 0, 1 | 0 = negative, 1 = positive |
| Text | N/A | No explicit labels (always positive) |
Next Steps
Image Segmentation
Learn the complete image segmentation workflow
Video Segmentation
Understand video tracking and propagation
Image Inference
See prompting examples for images
Video Inference
See prompting examples for videos