Skip to main content
SA-Co/Gold is a high-quality benchmark for promptable concept segmentation (PCS) in images. Each datapoint is independently annotated by 3 human annotators, allowing measurement of human agreement and establishing an upper bound for model performance.

Overview

The benchmark contains images paired with noun phrases (NPs), each annotated exhaustively with masks for all object instances matching the phrase. SA-Co/Gold comprises 7 specialized subsets, each targeting a different annotation domain.

Dataset Composition

7 Annotation Domains

Natural language descriptions from MetaCLIP’s captioning model
  • 33,393 image-NP pairs
  • 20,144 image-NP-masks
  • Source: MetaCLIP dataset
Natural language descriptions from SA-1B’s captioning model
  • 13,258 image-NP pairs
  • 30,306 image-NP-masks
  • Source: SA-1B dataset
Object descriptions with specific attributes (color, size, state)
  • 9,245 image-NP pairs
  • 3,663 image-NP-masks
  • Source: MetaCLIP dataset
Dense scenes with multiple overlapping instances
  • 20,687 image-NP pairs
  • 50,417 image-NP-masks
  • Source: MetaCLIP dataset
Common everyday objects from Wikipedia
  • 65,502 image-NP pairs
  • 6,448 image-NP-masks
  • Source: MetaCLIP dataset
Food and beverage items from Wikipedia
  • 13,951 image-NP pairs
  • 9,825 image-NP-masks
  • Source: MetaCLIP dataset
Sports-related objects from Wikipedia
  • 12,166 image-NP pairs
  • 5,075 image-NP-masks
  • Source: MetaCLIP dataset

Total Statistics

DomainMedia# Image-NPs# Image-NP-Masks
MetaCLIP captioner NPsMetaCLIP33,39320,144
SA-1B captioner NPsSA-1B13,25830,306
AttributesMetaCLIP9,2453,663
Crowded ScenesMetaCLIP20,68750,417
Wiki-Common1KMetaCLIP65,5026,448
Wiki-Food&DrinkMetaCLIP13,9519,825
Wiki-Sports EquipmentMetaCLIP12,1665,075
Image-NPs include both positive (object present) and negative (object absent) prompts. Image-NP-Masks count only the positive annotations with actual instance masks.

Download Dataset

Annotations

Download GT annotations from:

Images

There are two image sources:

1. MetaCLIP Images (6 out of 7 subsets)

Used in: MetaCLIP captioner NPs, Attributes, Crowded Scenes, Wiki-Common1K, Wiki-Food/Drink, Wiki-Sports Equipment
# Download from Roboflow
wget https://universe.roboflow.com/sa-co-gold/gold-metaclip-merged-a-release-test/

2. SA-1B Images (1 out of 7 subsets)

Used in: SA-1B captioner NPs
# Download from Roboflow
wget https://universe.roboflow.com/sa-co-gold/gold-sa-1b-merged-a-release-test/

# Or download from SA-1B directly
# Access the dynamic link for sa_co_gold.tar from:
# https://ai.meta.com/datasets/segment-anything-downloads/

Annotation Format

The annotation format is derived from COCO format with extensions for open-vocabulary segmentation.

Key Fields

Images

[
  {
    "id": 10000000,
    "file_name": "1/1001/metaclip_1_1001_c122868928880ae52b33fae1.jpeg",
    "text_input": "chili",
    "width": 600,
    "height": 600,
    "queried_category": "0",
    "is_instance_exhaustive": 1,
    "is_pixel_exhaustive": 1
  }
]
  • id: Unique identifier for the image-NP pair
  • text_input: The noun phrase prompt
  • file_name: Relative image path
  • is_instance_exhaustive: Boolean (1 = all instances correctly annotated)
  • is_pixel_exhaustive: Boolean (1 = all pixels covered, allows crowd segments)

Annotations

[
  {
    "id": 1,
    "image_id": 10000000,
    "area": 0.002477777777777778,
    "bbox": [0.443, 0.0, 0.108, 0.058],
    "segmentation": {
      "counts": "`kk42fb01O1O1O1O001O1O...",
      "size": [600, 600]
    },
    "category_id": 1,
    "iscrowd": 0
  }
]
  • bbox: Bounding box in [x, y, w, h] format, normalized by image dimensions
  • segmentation: Mask in RLE (Run-Length Encoding) format
  • iscrowd: Boolean (1 = segment covers multiple instances)
Positive vs Negative NPs: An image-NP pair is “positive” if it has corresponding annotations (object exists), and “negative” if it has no annotations (object doesn’t exist). Both are used in evaluation to test false positive rejection.

Benchmark Results

Overall Performance

ModelAverage cgF1IL_MCCpositive_micro_F1
SAM 354.060.8266.11
Human72.8--
DINO-X21.260.3855.2
OWLv2*24.590.5742.0
OWLv217.270.4636.8
Gemini 2.513.030.2946.1
APE16.410.4036.9
LLMDet-L6.500.2127.3
gDino-T3.250.1516.2
OWLv2* was partially trained on LVIS, giving it an advantage on standard benchmarks but not on SA-Co/Gold’s novel concepts.

Per-Domain Results (SAM 3)

DomaincgF1IL_MCCpositive_micro_F1
Crowded Scenes61.080.9067.73
Wiki-Sports Equipment65.520.8973.75
Wiki-Common1K54.930.7672.00
SA-1B captioner NPs53.690.8662.55
Wiki-Food/Drink53.410.7967.28
MetaCLIP captioner NPs47.260.8158.58
Attributes42.530.7060.85

Visualization

View examples from the dataset:
# See the example notebook
jupyter notebook examples/saco_gold_silver_vis_example.ipynb
Or view GT annotations and predictions side-by-side:
jupyter notebook examples/sam3_data_and_predictions_visualization.ipynb

Triple Annotation Design

Each datapoint has 3 independent annotations (a, b, c) to measure human agreement:
  • Annotators may disagree on precise mask borders
  • Annotators may disagree on the number of instances
  • Annotators may disagree on whether the phrase exists in the image
  • Dashed borders indicate special “group masks” covering multiple instances when separation is too difficult
This triple annotation allows evaluation to pick the most favorable annotation (oracle setting), providing a fairer comparison against the human upper bound.

Next Steps

Run Evaluations

Learn how to evaluate SAM 3 on SA-Co/Gold

SA-Co/Silver

Explore the diverse Silver benchmark

Build docs developers (and LLMs) love