Overview
The benchmark contains images paired with noun phrases (NPs), each annotated exhaustively with masks for all object instances matching the phrase. SA-Co/Gold comprises 7 specialized subsets, each targeting a different annotation domain.Dataset Composition
7 Annotation Domains
MetaCLIP Captioner NPs
MetaCLIP Captioner NPs
Natural language descriptions from MetaCLIP’s captioning model
- 33,393 image-NP pairs
- 20,144 image-NP-masks
- Source: MetaCLIP dataset
SA-1B Captioner NPs
SA-1B Captioner NPs
Natural language descriptions from SA-1B’s captioning model
- 13,258 image-NP pairs
- 30,306 image-NP-masks
- Source: SA-1B dataset
Attributes
Attributes
Object descriptions with specific attributes (color, size, state)
- 9,245 image-NP pairs
- 3,663 image-NP-masks
- Source: MetaCLIP dataset
Crowded Scenes
Crowded Scenes
Dense scenes with multiple overlapping instances
- 20,687 image-NP pairs
- 50,417 image-NP-masks
- Source: MetaCLIP dataset
Wiki-Common1K
Wiki-Common1K
Common everyday objects from Wikipedia
- 65,502 image-NP pairs
- 6,448 image-NP-masks
- Source: MetaCLIP dataset
Wiki-Food/Drink
Wiki-Food/Drink
Food and beverage items from Wikipedia
- 13,951 image-NP pairs
- 9,825 image-NP-masks
- Source: MetaCLIP dataset
Wiki-Sports Equipment
Wiki-Sports Equipment
Sports-related objects from Wikipedia
- 12,166 image-NP pairs
- 5,075 image-NP-masks
- Source: MetaCLIP dataset
Total Statistics
| Domain | Media | # Image-NPs | # Image-NP-Masks |
|---|---|---|---|
| MetaCLIP captioner NPs | MetaCLIP | 33,393 | 20,144 |
| SA-1B captioner NPs | SA-1B | 13,258 | 30,306 |
| Attributes | MetaCLIP | 9,245 | 3,663 |
| Crowded Scenes | MetaCLIP | 20,687 | 50,417 |
| Wiki-Common1K | MetaCLIP | 65,502 | 6,448 |
| Wiki-Food&Drink | MetaCLIP | 13,951 | 9,825 |
| Wiki-Sports Equipment | MetaCLIP | 12,166 | 5,075 |
Image-NPs include both positive (object present) and negative (object absent) prompts. Image-NP-Masks count only the positive annotations with actual instance masks.
Download Dataset
Annotations
Download GT annotations from:- Hugging Face: facebook/SACo-Gold
- Roboflow: sa-co-gold
Images
There are two image sources:1. MetaCLIP Images (6 out of 7 subsets)
Used in: MetaCLIP captioner NPs, Attributes, Crowded Scenes, Wiki-Common1K, Wiki-Food/Drink, Wiki-Sports Equipment2. SA-1B Images (1 out of 7 subsets)
Used in: SA-1B captioner NPsAnnotation Format
The annotation format is derived from COCO format with extensions for open-vocabulary segmentation.Key Fields
Images
id: Unique identifier for the image-NP pairtext_input: The noun phrase promptfile_name: Relative image pathis_instance_exhaustive: Boolean (1 = all instances correctly annotated)is_pixel_exhaustive: Boolean (1 = all pixels covered, allows crowd segments)
Annotations
bbox: Bounding box in [x, y, w, h] format, normalized by image dimensionssegmentation: Mask in RLE (Run-Length Encoding) formatiscrowd: Boolean (1 = segment covers multiple instances)
Positive vs Negative NPs: An image-NP pair is “positive” if it has corresponding annotations (object exists), and “negative” if it has no annotations (object doesn’t exist). Both are used in evaluation to test false positive rejection.
Benchmark Results
Overall Performance
| Model | Average cgF1 | IL_MCC | positive_micro_F1 |
|---|---|---|---|
| SAM 3 | 54.06 | 0.82 | 66.11 |
| Human | 72.8 | - | - |
| DINO-X | 21.26 | 0.38 | 55.2 |
| OWLv2* | 24.59 | 0.57 | 42.0 |
| OWLv2 | 17.27 | 0.46 | 36.8 |
| Gemini 2.5 | 13.03 | 0.29 | 46.1 |
| APE | 16.41 | 0.40 | 36.9 |
| LLMDet-L | 6.50 | 0.21 | 27.3 |
| gDino-T | 3.25 | 0.15 | 16.2 |
OWLv2* was partially trained on LVIS, giving it an advantage on standard benchmarks but not on SA-Co/Gold’s novel concepts.
Per-Domain Results (SAM 3)
| Domain | cgF1 | IL_MCC | positive_micro_F1 |
|---|---|---|---|
| Crowded Scenes | 61.08 | 0.90 | 67.73 |
| Wiki-Sports Equipment | 65.52 | 0.89 | 73.75 |
| Wiki-Common1K | 54.93 | 0.76 | 72.00 |
| SA-1B captioner NPs | 53.69 | 0.86 | 62.55 |
| Wiki-Food/Drink | 53.41 | 0.79 | 67.28 |
| MetaCLIP captioner NPs | 47.26 | 0.81 | 58.58 |
| Attributes | 42.53 | 0.70 | 60.85 |
Visualization
View examples from the dataset:Triple Annotation Design
Each datapoint has 3 independent annotations (a, b, c) to measure human agreement:- Annotators may disagree on precise mask borders
- Annotators may disagree on the number of instances
- Annotators may disagree on whether the phrase exists in the image
- Dashed borders indicate special “group masks” covering multiple instances when separation is too difficult
Next Steps
Run Evaluations
Learn how to evaluate SAM 3 on SA-Co/Gold
SA-Co/Silver
Explore the diverse Silver benchmark