SAM 3: Segment Anything with Concepts
SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.Key Features
Open-Vocabulary Segmentation
Segment any object using natural language descriptions. SAM 3 can handle over 270K unique concepts with 75-80% of human performance.
Multi-Modal Prompting
Prompt with text, points, boxes, masks, or combinations thereof for precise segmentation control.
Video Tracking
Track and segment objects across video frames with temporal consistency and interactive refinement capabilities.
Unified Architecture
848M parameter model with a decoupled detector-tracker design that scales efficiently with data.
What’s New in SAM 3
Compared to its predecessor SAM 2, SAM 3 introduces:- Concept-based segmentation: Exhaustively segment all instances of an open-vocabulary concept specified by text or exemplars
- Presence token: Improved discrimination between closely related prompts (e.g., “a player in white” vs. “a player in red”)
- Massive concept coverage: Trained on 4+ million unique concepts, the largest high-quality open-vocabulary segmentation dataset
- Decoupled architecture: Separate detector and tracker minimize task interference and improve performance
SAM 3 achieves state-of-the-art results on instance segmentation and box detection benchmarks including LVIS, COCO, and the new SA-Co dataset.
Performance Highlights
SAM 3 demonstrates exceptional performance across multiple benchmarks:- SA-Co/Gold (Instance Segmentation): 54.1 cgF1 (vs. 72.8 human performance)
- LVIS (Instance Segmentation): 48.5 AP
- COCO (Box Detection): 56.4 AP
- SA-V Video Test: 58.0 pHOTA
Common Use Cases
Image Segmentation
Segment objects in images using text descriptions or visual prompts for content analysis and editing.
Video Object Tracking
Track specific objects across video frames for surveillance, sports analysis, or content creation.
Interactive Annotation
Create high-quality annotations with point and box prompts for dataset creation.
Visual Search
Find all instances of specific concepts in large image or video collections.
Get Started
Installation
Install SAM 3 and set up your environment
Quick Start
Run your first segmentation in minutes
Guides
Explore guides for image and video inference
Architecture Overview
SAM 3 consists of three main components:- Shared Vision Encoder: Extracts visual features from images or video frames
- Detector: DETR-based model conditioned on text, geometry, and image exemplars
- Tracker: Inherits SAM 2 transformer encoder-decoder architecture for video segmentation
Next Steps
Ready to get started? Follow our installation guide to set up SAM 3, then try the quick start tutorial to run your first segmentation.Installation Guide
Install SAM 3 and configure your environment