SA-Co/Gold Benchmark

SA-Co/Gold is a high-quality benchmark for promptable concept segmentation (PCS) in images. Each datapoint is independently annotated by 3 human annotators, allowing measurement of human agreement and establishing an upper bound for model performance.

Overview

The benchmark contains images paired with noun phrases (NPs), each annotated exhaustively with masks for all object instances matching the phrase. SA-Co/Gold comprises 7 specialized subsets, each targeting a different annotation domain.

Dataset Composition

7 Annotation Domains

MetaCLIP Captioner NPs

Natural language descriptions from MetaCLIP’s captioning model

33,393 image-NP pairs
20,144 image-NP-masks
Source: MetaCLIP dataset

SA-1B Captioner NPs

Natural language descriptions from SA-1B’s captioning model

13,258 image-NP pairs
30,306 image-NP-masks
Source: SA-1B dataset

Attributes

Object descriptions with specific attributes (color, size, state)

9,245 image-NP pairs
3,663 image-NP-masks
Source: MetaCLIP dataset

Crowded Scenes

Dense scenes with multiple overlapping instances

20,687 image-NP pairs
50,417 image-NP-masks
Source: MetaCLIP dataset

Wiki-Common1K

Common everyday objects from Wikipedia

65,502 image-NP pairs
6,448 image-NP-masks
Source: MetaCLIP dataset

Wiki-Food/Drink

Food and beverage items from Wikipedia

13,951 image-NP pairs
9,825 image-NP-masks
Source: MetaCLIP dataset

Wiki-Sports Equipment

Sports-related objects from Wikipedia

12,166 image-NP pairs
5,075 image-NP-masks
Source: MetaCLIP dataset

Total Statistics

Domain	Media	# Image-NPs	# Image-NP-Masks
MetaCLIP captioner NPs	MetaCLIP	33,393	20,144
SA-1B captioner NPs	SA-1B	13,258	30,306
Attributes	MetaCLIP	9,245	3,663
Crowded Scenes	MetaCLIP	20,687	50,417
Wiki-Common1K	MetaCLIP	65,502	6,448
Wiki-Food&Drink	MetaCLIP	13,951	9,825
Wiki-Sports Equipment	MetaCLIP	12,166	5,075

Image-NPs include both positive (object present) and negative (object absent) prompts. Image-NP-Masks count only the positive annotations with actual instance masks.

Download Dataset

Annotations

Download GT annotations from:

Hugging Face: facebook/SACo-Gold
Roboflow: sa-co-gold

Images

There are two image sources:

1. MetaCLIP Images (6 out of 7 subsets)

Used in: MetaCLIP captioner NPs, Attributes, Crowded Scenes, Wiki-Common1K, Wiki-Food/Drink, Wiki-Sports Equipment

# Download from Roboflow
wget https://universe.roboflow.com/sa-co-gold/gold-metaclip-merged-a-release-test/

2. SA-1B Images (1 out of 7 subsets)

Used in: SA-1B captioner NPs

# Download from Roboflow
wget https://universe.roboflow.com/sa-co-gold/gold-sa-1b-merged-a-release-test/

# Or download from SA-1B directly
# Access the dynamic link for sa_co_gold.tar from:
# https://ai.meta.com/datasets/segment-anything-downloads/

Annotation Format

The annotation format is derived from COCO format with extensions for open-vocabulary segmentation.

Key Fields

Images

[
  {
    "id": 10000000,
    "file_name": "1/1001/metaclip_1_1001_c122868928880ae52b33fae1.jpeg",
    "text_input": "chili",
    "width": 600,
    "height": 600,
    "queried_category": "0",
    "is_instance_exhaustive": 1,
    "is_pixel_exhaustive": 1
  }
]

id: Unique identifier for the image-NP pair
text_input: The noun phrase prompt
file_name: Relative image path
is_instance_exhaustive: Boolean (1 = all instances correctly annotated)
is_pixel_exhaustive: Boolean (1 = all pixels covered, allows crowd segments)

Annotations

[
  {
    "id": 1,
    "image_id": 10000000,
    "area": 0.002477777777777778,
    "bbox": [0.443, 0.0, 0.108, 0.058],
    "segmentation": {
      "counts": "`kk42fb01O1O1O1O001O1O...",
      "size": [600, 600]
    },
    "category_id": 1,
    "iscrowd": 0
  }
]

bbox: Bounding box in [x, y, w, h] format, normalized by image dimensions
segmentation: Mask in RLE (Run-Length Encoding) format
iscrowd: Boolean (1 = segment covers multiple instances)

Positive vs Negative NPs: An image-NP pair is “positive” if it has corresponding annotations (object exists), and “negative” if it has no annotations (object doesn’t exist). Both are used in evaluation to test false positive rejection.

Benchmark Results

Overall Performance

Model	Average cgF1	IL_MCC	positive_micro_F1
SAM 3	54.06	0.82	66.11
Human	72.8	-	-
DINO-X	21.26	0.38	55.2
OWLv2*	24.59	0.57	42.0
OWLv2	17.27	0.46	36.8
Gemini 2.5	13.03	0.29	46.1
APE	16.41	0.40	36.9
LLMDet-L	6.50	0.21	27.3
gDino-T	3.25	0.15	16.2

OWLv2* was partially trained on LVIS, giving it an advantage on standard benchmarks but not on SA-Co/Gold’s novel concepts.

Per-Domain Results (SAM 3)

Domain	cgF1	IL_MCC	positive_micro_F1
Crowded Scenes	61.08	0.90	67.73
Wiki-Sports Equipment	65.52	0.89	73.75
Wiki-Common1K	54.93	0.76	72.00
SA-1B captioner NPs	53.69	0.86	62.55
Wiki-Food/Drink	53.41	0.79	67.28
MetaCLIP captioner NPs	47.26	0.81	58.58
Attributes	42.53	0.70	60.85

Visualization

View examples from the dataset:

# See the example notebook
jupyter notebook examples/saco_gold_silver_vis_example.ipynb

Or view GT annotations and predictions side-by-side:

jupyter notebook examples/sam3_data_and_predictions_visualization.ipynb

Triple Annotation Design

Each datapoint has 3 independent annotations (a, b, c) to measure human agreement:

Annotators may disagree on precise mask borders
Annotators may disagree on the number of instances
Annotators may disagree on whether the phrase exists in the image
Dashed borders indicate special “group masks” covering multiple instances when separation is too difficult

This triple annotation allows evaluation to pick the most favorable annotation (oracle setting), providing a fairer comparison against the human upper bound.

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

SA-Co/Gold Benchmark

Overview

Dataset Composition

7 Annotation Domains

Total Statistics

Download Dataset

Annotations

Images

1. MetaCLIP Images (6 out of 7 subsets)

2. SA-1B Images (1 out of 7 subsets)

Annotation Format

Key Fields

Images

Annotations

Benchmark Results

Overall Performance

Per-Domain Results (SAM 3)

Visualization

Triple Annotation Design

Next Steps

Run Evaluations

SA-Co/Silver

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Overview

​Dataset Composition

​7 Annotation Domains

​Total Statistics

​Download Dataset

​Annotations

​Images

​1. MetaCLIP Images (6 out of 7 subsets)

​2. SA-1B Images (1 out of 7 subsets)

​Annotation Format

​Key Fields

​Images

​Annotations

​Benchmark Results

​Overall Performance

​Per-Domain Results (SAM 3)

​Visualization

​Triple Annotation Design

​Next Steps

Run Evaluations

SA-Co/Silver

Build docs developers (and LLMs) love

Overview

Dataset Composition

7 Annotation Domains

Total Statistics

Download Dataset

Annotations

Images

1. MetaCLIP Images (6 out of 7 subsets)

2. SA-1B Images (1 out of 7 subsets)

Annotation Format

Key Fields

Images

Annotations

Benchmark Results

Overall Performance

Per-Domain Results (SAM 3)

Visualization

Triple Annotation Design

Next Steps