Skip to main content

SAM 3: Segment Anything with Concepts

SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.

Key Features

Open-Vocabulary Segmentation

Segment any object using natural language descriptions. SAM 3 can handle over 270K unique concepts with 75-80% of human performance.

Multi-Modal Prompting

Prompt with text, points, boxes, masks, or combinations thereof for precise segmentation control.

Video Tracking

Track and segment objects across video frames with temporal consistency and interactive refinement capabilities.

Unified Architecture

848M parameter model with a decoupled detector-tracker design that scales efficiently with data.

What’s New in SAM 3

Compared to its predecessor SAM 2, SAM 3 introduces:
  • Concept-based segmentation: Exhaustively segment all instances of an open-vocabulary concept specified by text or exemplars
  • Presence token: Improved discrimination between closely related prompts (e.g., “a player in white” vs. “a player in red”)
  • Massive concept coverage: Trained on 4+ million unique concepts, the largest high-quality open-vocabulary segmentation dataset
  • Decoupled architecture: Separate detector and tracker minimize task interference and improve performance
SAM 3 achieves state-of-the-art results on instance segmentation and box detection benchmarks including LVIS, COCO, and the new SA-Co dataset.

Performance Highlights

SAM 3 demonstrates exceptional performance across multiple benchmarks:
  • SA-Co/Gold (Instance Segmentation): 54.1 cgF1 (vs. 72.8 human performance)
  • LVIS (Instance Segmentation): 48.5 AP
  • COCO (Box Detection): 56.4 AP
  • SA-V Video Test: 58.0 pHOTA

Common Use Cases

Image Segmentation

Segment objects in images using text descriptions or visual prompts for content analysis and editing.

Video Object Tracking

Track specific objects across video frames for surveillance, sports analysis, or content creation.

Interactive Annotation

Create high-quality annotations with point and box prompts for dataset creation.

Visual Search

Find all instances of specific concepts in large image or video collections.

Get Started

Installation

Install SAM 3 and set up your environment

Quick Start

Run your first segmentation in minutes

Guides

Explore guides for image and video inference

Architecture Overview

SAM 3 consists of three main components:
  1. Shared Vision Encoder: Extracts visual features from images or video frames
  2. Detector: DETR-based model conditioned on text, geometry, and image exemplars
  3. Tracker: Inherits SAM 2 transformer encoder-decoder architecture for video segmentation
The decoupled design allows each component to specialize in its task while sharing a common visual representation.

Next Steps

Ready to get started? Follow our installation guide to set up SAM 3, then try the quick start tutorial to run your first segmentation.

Installation Guide

Install SAM 3 and configure your environment

Build docs developers (and LLMs) love