Overview
The benchmark contains images paired with noun phrases (NPs), each exhaustively annotated with masks for all object instances matching the phrase. SA-Co/Silver comprises 10 subsets covering diverse visual domains.Dataset Composition
10 Annotation Domains
BDD100k - Driving Scenes
BDD100k - Driving Scenes
Urban driving scenarios from Berkeley Driving Dataset
- 5,546 image-NP pairs
- 13,210 image-NP-masks
- Domain: Autonomous driving
DROID - Robotics
DROID - Robotics
Robot manipulation scenarios from diverse environments
- 9,445 image-NP pairs
- 11,098 image-NP-masks
- Domain: Robotics and manipulation
Ego4D - Egocentric Video
Ego4D - Egocentric Video
First-person perspective frames from daily activities
- 12,608 image-NP pairs
- 24,049 image-NP-masks
- Domain: Egocentric vision
MyFoodRepo-273 - Food Recognition
MyFoodRepo-273 - Food Recognition
Food dishes and ingredients
- 20,985 image-NP pairs
- 28,347 image-NP-masks
- Domain: Food recognition
GeoDE - Geographic Diversity
GeoDE - Geographic Diversity
Images from geographically diverse locations worldwide
- 14,850 image-NP pairs
- 7,570 image-NP-masks
- Domain: Geographic diversity
iNaturalist-2017 - Wildlife
iNaturalist-2017 - Wildlife
Natural world observations of plants and animals
- 1,439,051 image-NP pairs
- 48,899 image-NP-masks
- Domain: Biodiversity and nature
National Gallery of Art - Art
National Gallery of Art - Art
Artworks from the National Gallery of Art collection
- 22,294 image-NP pairs
- 18,991 image-NP-masks
- Domain: Art and cultural heritage
SA-V - General Video
SA-V - General Video
Diverse video frames from Segment Anything Video dataset
- 18,337 image-NP pairs
- 39,683 image-NP-masks
- Domain: General video understanding
YT-Temporal-1B - YouTube
YT-Temporal-1B - YouTube
Frames from YouTube videos across various categories
- 7,816 image-NP pairs
- 12,221 image-NP-masks
- Domain: Web video
Fathomnet - Underwater
Fathomnet - Underwater
Marine life and underwater environments
- 287,193 image-NP pairs
- 14,174 image-NP-masks
- Domain: Marine biology
Statistics Table
| Domain | # Image-NPs | # Image-NP-Masks |
|---|---|---|
| BDD100k | 5,546 | 13,210 |
| DROID | 9,445 | 11,098 |
| Ego4D | 12,608 | 24,049 |
| MyFoodRepo-273 | 20,985 | 28,347 |
| GeoDE | 14,850 | 7,570 |
| iNaturalist-2017 | 1,439,051 | 48,899 |
| National Gallery of Art | 22,294 | 18,991 |
| SA-V | 18,337 | 39,683 |
| YT-Temporal-1B | 7,816 | 12,221 |
| Fathomnet | 287,193 | 14,174 |
Download Dataset
Annotations
Download GT annotations from:- Hugging Face: facebook/SACo-Silver
- Roboflow: sa-co-silver
Images and Frames
Each domain has different download instructions:- Image Datasets
- Frame Datasets
Annotation Format
The annotation format is identical to SA-Co/Gold, derived from COCO format.Example from DROID Domain
Images
Annotations
For detailed field descriptions, see the SA-Co/Gold annotation format which is identical.
Benchmark Results
Overall Performance
| Model | Average cgF1 | IL_MCC | pmF1 |
|---|---|---|---|
| SAM 3 | 49.57 | 0.76 | 65.17 |
| OWLv2* | 11.23 | 0.32 | 31.18 |
| Gemini 2.5 | 9.67 | 0.19 | 45.51 |
| OWLv2 | 8.18 | 0.23 | 32.55 |
| LLMDet-L | 6.73 | 0.17 | 28.19 |
| gDino-T | 3.09 | 0.12 | 19.75 |
Per-Domain Results (SAM 3)
| Domain | cgF1 | IL_MCC | pmF1 |
|---|---|---|---|
| iNaturalist | 70.07 | 0.89 | 78.73 |
| National Gallery of Art | 65.80 | 0.82 | 80.67 |
| Food Recognition | 52.96 | 0.79 | 67.21 |
| Fathomnet | 51.53 | 0.86 | 59.98 |
| BDD100k | 46.61 | 0.78 | 60.13 |
| DROID | 45.58 | 0.76 | 60.35 |
| YT-Temporal-1B | 42.07 | 0.72 | 58.36 |
| Ego4D | 38.64 | 0.62 | 62.56 |
| SA-V | 38.06 | 0.66 | 57.62 |
| GeoDE | 44.36 | 0.67 | 66.05 |
Visualization
View examples from the dataset:Offline Evaluation
If you have predictions in COCO result format:Next Steps
Run Evaluations
Learn how to evaluate SAM 3 on SA-Co/Silver
SA-Co/VEval
Explore the video benchmark