Skip to main content
SAM 3 provides comprehensive evaluation benchmarks for promptable concept segmentation (PCS) in both images and videos. The evaluation suite consists of the SA-Co dataset, which includes two image benchmarks (Gold and Silver) and one video benchmark (VEval).

SA-Co Dataset

The SA-Co dataset is designed to evaluate open-vocabulary segmentation capabilities with text prompts. It contains images and videos paired with noun phrases (NPs), each exhaustively annotated with instance masks for all objects matching the phrase.

Dataset Components

SA-Co/Gold

High-quality image benchmark with 3 independent annotations per datapoint

SA-Co/Silver

Diverse image benchmark spanning 10 different domains

SA-Co/VEval

Video benchmark with 3 domains for temporal segmentation

Benchmark Scale

SAM 3 achieves 75-80% of human performance on the SA-Co benchmark, which contains:
  • 270,000+ unique concepts - over 50 times more than existing benchmarks
  • 4 million+ annotated concepts in the training data
  • Multiple annotation domains covering diverse visual scenarios

Evaluation Metrics

Primary Metric: cgF1

The official metric for all SA-Co benchmarks is cgF1 (concept-grounded F1 score). This metric evaluates:
  • Detection accuracy - correctly identifying object instances
  • Segmentation quality - mask precision at the instance level
  • Negative prompt handling - correctly rejecting non-existent objects

Additional Metrics

  • IL_MCC - Instance-level Matthews Correlation Coefficient
  • positive_micro_F1 / pmF1 - F1 score computed only on positive (present) prompts
  • pHOTA - Promptable Higher Order Tracking Accuracy (video)
  • AP - Average Precision (for comparison with standard benchmarks)

Benchmark Results

Image Segmentation Performance

On the SA-Co/Gold benchmark, SAM 3 achieves:
ModelSA-Co/Gold cgF1LVIS APLVIS cgF1
Human72.8--
SAM 354.148.537.2
DINO-X21.338.5-
OWLv2*24.643.429.3
Gemini 2.513.0-13.4

Video Segmentation Performance

On the SA-Co/VEval benchmarks:
DatasetHuman cgF1SAM 3 cgF1SAM 3 pHOTA
SA-V test53.130.358.0
YT-Temporal-1B test71.250.869.9
SmartGlasses test58.536.463.6
SAM 3 achieves approximately 74-75% of human performance across the SA-Co benchmarks, representing a significant advancement in open-vocabulary segmentation capabilities.

Download Locations

All SA-Co datasets are available from two hosting platforms:

Hugging Face

Roboflow

Next Steps

Run Evaluations

Learn how to run evaluations on your own predictions

Dataset Details

Explore the detailed structure of each benchmark

Build docs developers (and LLMs) love