SA-Co/Silver Benchmark

SA-Co/Silver is a large-scale, diverse benchmark for promptable concept segmentation (PCS) in images. Unlike SA-Co/Gold, each datapoint has a single ground-truth annotation, covering 10 different domains from food to robotics to underwater imagery.

Overview

The benchmark contains images paired with noun phrases (NPs), each exhaustively annotated with masks for all object instances matching the phrase. SA-Co/Silver comprises 10 subsets covering diverse visual domains.

Since SA-Co/Silver has only one annotation per datapoint (unlike Gold’s triple annotations), results may slightly underestimate model performance due to not accounting for different valid interpretations of each query.

Dataset Composition

10 Annotation Domains

BDD100k - Driving Scenes

Urban driving scenarios from Berkeley Driving Dataset

5,546 image-NP pairs
13,210 image-NP-masks
Domain: Autonomous driving

DROID - Robotics

Robot manipulation scenarios from diverse environments

9,445 image-NP pairs
11,098 image-NP-masks
Domain: Robotics and manipulation

Ego4D - Egocentric Video

First-person perspective frames from daily activities

12,608 image-NP pairs
24,049 image-NP-masks
Domain: Egocentric vision

MyFoodRepo-273 - Food Recognition

Food dishes and ingredients

20,985 image-NP pairs
28,347 image-NP-masks
Domain: Food recognition

GeoDE - Geographic Diversity

Images from geographically diverse locations worldwide

14,850 image-NP pairs
7,570 image-NP-masks
Domain: Geographic diversity

iNaturalist-2017 - Wildlife

Natural world observations of plants and animals

1,439,051 image-NP pairs
48,899 image-NP-masks
Domain: Biodiversity and nature

National Gallery of Art - Art

Artworks from the National Gallery of Art collection

22,294 image-NP pairs
18,991 image-NP-masks
Domain: Art and cultural heritage

SA-V - General Video

Diverse video frames from Segment Anything Video dataset

18,337 image-NP pairs
39,683 image-NP-masks
Domain: General video understanding

YT-Temporal-1B - YouTube

Frames from YouTube videos across various categories

7,816 image-NP pairs
12,221 image-NP-masks
Domain: Web video

Fathomnet - Underwater

Marine life and underwater environments

287,193 image-NP pairs
14,174 image-NP-masks
Domain: Marine biology

Statistics Table

Domain	# Image-NPs	# Image-NP-Masks
BDD100k	5,546	13,210
DROID	9,445	11,098
Ego4D	12,608	24,049
MyFoodRepo-273	20,985	28,347
GeoDE	14,850	7,570
iNaturalist-2017	1,439,051	48,899
National Gallery of Art	22,294	18,991
SA-V	18,337	39,683
YT-Temporal-1B	7,816	12,221
Fathomnet	287,193	14,174

Download Dataset

Annotations

Download GT annotations from:

Hugging Face: facebook/SACo-Silver
Roboflow: sa-co-silver

Images and Frames

Each domain has different download instructions:

Image Datasets
Frame Datasets

GeoDE

# Option 1: Download processed images from Roboflow
wget https://universe.roboflow.com/sa-co-silver/geode/

# Option 2: Process raw images yourself
# 1. Download from https://geodiverse-data-collection.cs.princeton.edu/
# 2. Run preprocessing
python preprocess_silver_geode_bdd100k_food_rec.py \
  --annotation_file <ANNOTATIONS>/silver_geode_merged_test.json \
  --raw_images_folder <RAW_GEODE_IMAGES_FOLDER> \
  --processed_images_folder <PROCESSED_GEODE_IMAGES_FOLDER> \
  --dataset_name geode

National Gallery of Art (NGA)

# Download and preprocess automatically
python download_preprocess_nga.py \
  --annotation_file <ANNOTATIONS>/silver_nga_art_merged_test.json \
  --raw_images_folder <RAW_NGA_IMAGES_FOLDER> \
  --processed_images_folder <PROCESSED_NGA_IMAGES_FOLDER>

BDD100k

# 1. Download 100K Images from http://bdd-data.berkeley.edu/download.html
# 2. Preprocess
python preprocess_silver_geode_bdd100k_food_rec.py \
  --annotation_file <ANNOTATIONS>/silver_bdd100k_merged_test.json \
  --raw_images_folder <RAW_BDD_IMAGES_FOLDER> \
  --processed_images_folder <PROCESSED_BDD_IMAGES_FOLDER> \
  --dataset_name bdd100k

Food Recognition Challenge 2022

# 1. Download from https://www.aicrowd.com/challenges/food-recognition-benchmark-2022
#    File: [Round 2] public_validation_set_2.0.tar.gz
# 2. Preprocess
python preprocess_silver_geode_bdd100k_food_rec.py \
  --annotation_file <ANNOTATIONS>/silver_food_rec_merged_test.json \
  --raw_images_folder <RAW_FOOD_IMAGES_FOLDER> \
  --processed_images_folder <PROCESSED_FOOD_IMAGES_FOLDER> \
  --dataset_name food_rec

iNaturalist

# Download and extract automatically
python download_inaturalist.py \
  --raw_images_folder <RAW_INATURALIST_IMAGES_FOLDER> \
  --processed_images_folder <PROCESSED_INATURALIST_IMAGES_FOLDER>

Fathomnet

# 1. Install FathomNet API
pip install fathomnet

# 2. Download images
python download_fathomnet.py \
  --processed_images_folder <PROCESSED_FATHOMNET_IMAGES_FOLDER>

Before downloading frame datasets, update CONFIG_FRAMES.yaml with the correct path_annotations path where the annotation files are located.

DROID

# 1. Install gsutil
pip install gsutil

# 2. Update CONFIG_FRAMES.yaml with droid_path

# 3. Download videos
python download_videos.py droid

# 4. Extract frames
python extract_frames.py droid

SA-V

# 1. Get download links from https://ai.meta.com/datasets/segment-anything-video-downloads/
# 2. Update CONFIG_FRAMES.yaml with sav_path and download link
# 3. Download videos
python download_videos.py sav

# 4. Extract frames
python extract_frames.py sav

Ego4D

# 1. Accept license at https://ego4d-data.org/docs/start-here/#license-agreement
# 2. Configure AWS credentials
pip install awscli ego4d
aws configure  # Use credentials from email

# 3. Update CONFIG_FRAMES.yaml with AWS credentials and ego4d_path
# 4. Download clips
python download_videos.py ego4d

# 5. Extract frames
python extract_frames.py ego4d

YT-Temporal-1B

# 1. Install yt-dlp
python3 -m pip install -U "yt-dlp[default]"

# 2. Create cookies.txt following:
#    https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies

# 3. Update CONFIG_FRAMES.yaml with cookies_path and yt1b_path
# 4. Download videos
python download_videos.py yt1b

# 5. Extract frames
python extract_frames.py yt1b

Annotation Format

The annotation format is identical to SA-Co/Gold, derived from COCO format.

Example from DROID Domain

Images

[
  {
    "id": 10000000,
    "file_name": "AUTOLab_failure_2023-07-07_Fri_Jul__7_18:50:36_2023_recordings_MP4_22008760/00002.jpg",
    "text_input": "the large wooden table",
    "width": 1280,
    "height": 720,
    "queried_category": "3",
    "is_instance_exhaustive": 1,
    "is_pixel_exhaustive": 1
  }
]

Annotations

[
  {
    "area": 0.17324327256944444,
    "id": 1,
    "image_id": 10000000,
    "bbox": [0.0375, 0.5083, 0.8383, 0.4917],
    "segmentation": {
      "counts": "[^R11]f03O0O100O2N100O...",
      "size": [720, 1280]
    },
    "category_id": 1,
    "iscrowd": 0
  }
]

For detailed field descriptions, see the SA-Co/Gold annotation format which is identical.

Benchmark Results

Overall Performance

Model	Average cgF1	IL_MCC	pmF1
SAM 3	49.57	0.76	65.17
OWLv2*	11.23	0.32	31.18
Gemini 2.5	9.67	0.19	45.51
OWLv2	8.18	0.23	32.55
LLMDet-L	6.73	0.17	28.19
gDino-T	3.09	0.12	19.75

Per-Domain Results (SAM 3)

Domain	cgF1	IL_MCC	pmF1
iNaturalist	70.07	0.89	78.73
National Gallery of Art	65.80	0.82	80.67
Food Recognition	52.96	0.79	67.21
Fathomnet	51.53	0.86	59.98
BDD100k	46.61	0.78	60.13
DROID	45.58	0.76	60.35
YT-Temporal-1B	42.07	0.72	58.36
Ego4D	38.64	0.62	62.56
SA-V	38.06	0.66	57.62
GeoDE	44.36	0.67	66.05

Visualization

View examples from the dataset:

# See the example notebook
jupyter notebook examples/saco_gold_silver_vis_example.ipynb

Offline Evaluation

If you have predictions in COCO result format:

# Evaluate all subsets
jupyter notebook examples/saco_gold_silver_eval_example.ipynb

# Or evaluate a single subset
python scripts/eval/standalone_cgf1.py \
  --pred_file /path/to/coco_predictions_segm.json \
  --gt_files /path/to/annotations/silver_bdd100k_merged_test.json

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

SA-Co/Silver Benchmark

Overview

Dataset Composition

10 Annotation Domains

Statistics Table

Download Dataset

Annotations

Images and Frames

GeoDE

National Gallery of Art (NGA)

BDD100k

Food Recognition Challenge 2022

iNaturalist

Fathomnet

DROID

SA-V

Ego4D

YT-Temporal-1B

Annotation Format

Example from DROID Domain

Images

Annotations

Benchmark Results

Overall Performance

Per-Domain Results (SAM 3)

Visualization

Offline Evaluation

Next Steps

Run Evaluations

SA-Co/VEval

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Overview

​Dataset Composition

​10 Annotation Domains

​Statistics Table

​Download Dataset

​Annotations

​Images and Frames

​GeoDE

​National Gallery of Art (NGA)

​BDD100k

​Food Recognition Challenge 2022

​iNaturalist

​Fathomnet

​DROID

​SA-V

​Ego4D

​YT-Temporal-1B

​Annotation Format

​Example from DROID Domain

​Images

​Annotations

​Benchmark Results

​Overall Performance

​Per-Domain Results (SAM 3)

​Visualization

​Offline Evaluation

​Next Steps

Run Evaluations

SA-Co/VEval

Build docs developers (and LLMs) love

Overview

Dataset Composition

10 Annotation Domains

Statistics Table

Download Dataset

Annotations

Images and Frames

GeoDE

National Gallery of Art (NGA)

BDD100k

Food Recognition Challenge 2022

iNaturalist

Fathomnet

DROID

SA-V

Ego4D

YT-Temporal-1B

Annotation Format

Example from DROID Domain

Images

Annotations

Benchmark Results

Overall Performance

Per-Domain Results (SAM 3)

Visualization

Offline Evaluation

Next Steps