Running Evaluations

This guide covers running evaluations on all SA-Co benchmarks: Gold, Silver, and VEval. You can evaluate SAM 3 directly or run offline evaluation on your own predictions.

Prerequisites

Install Dependencies

# For Gold and Silver evaluations
pip install -e ".[train,dev]"

# For VEval video evaluations (additional)
pip install -e ".[veval]"

Download Datasets

Before running evaluations, download the appropriate datasets:

SA-Co/Gold: See download instructions
SA-Co/Silver: See download instructions
SA-Co/VEval: See download instructions

Configuration

Edit Base Configuration

All evaluations require updating sam3/train/configs/eval_base.yaml with your dataset paths:

# Edit this file with:
# - Paths to downloaded images/videos
# - Paths to annotation files
# - Output directory for predictions
# - (Optional) SLURM cluster configuration

Running SAM 3 Evaluations

SA-Co/Gold Evaluation

SA-Co/Gold has 7 subsets to evaluate. You can run locally or on a SLURM cluster.

Local Execution
SLURM Cluster

# Run on local machine with GPUs
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml \
  --use-cluster 0 \
  --num-gpus 1

Adjust --num-gpus based on your available hardware.

# Run on SLURM cluster
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml \
  --use-cluster 1

Ensure eval_base.yaml has your SLURM configuration (partition, QoS, etc.).

All 7 Gold Subsets

MetaCLIP Captioner NPs

python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml \
  --use-cluster 0 --num-gpus 1

SA-1B Captioner NPs

python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_sa1b_nps.yaml \
  --use-cluster 0 --num-gpus 1

This subset uses SA-1B images, not MetaCLIP images.

Attributes

python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_attributes.yaml \
  --use-cluster 0 --num-gpus 1

Crowded Scenes

python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_crowded.yaml \
  --use-cluster 0 --num-gpus 1

Wiki-Common1K

python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_wiki_common.yaml \
  --use-cluster 0 --num-gpus 1

Wiki-Food/Drink

python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_fg_food.yaml \
  --use-cluster 0 --num-gpus 1

Wiki-Sports Equipment

python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_fg_sports.yaml \
  --use-cluster 0 --num-gpus 1

SA-Co/Silver Evaluation

SA-Co/Silver has 10 subsets to evaluate.

# Example: BDD100k subset
python sam3/train/train.py \
  -c configs/silver_image_evals/sam3_gold_image_bdd100k.yaml \
  --use-cluster 0 \
  --num-gpus 1

Replace the config file for other subsets:

sam3_gold_image_bdd100k.yaml
sam3_gold_image_droid.yaml
sam3_gold_image_ego4d.yaml
sam3_gold_image_food_rec.yaml
sam3_gold_image_geode.yaml
sam3_gold_image_inaturalist.yaml
sam3_gold_image_nga_art.yaml
sam3_gold_image_sav.yaml
sam3_gold_image_yt1b.yaml
sam3_gold_image_fathomnet.yaml

Offline Evaluation

If you have predictions in COCO result format, you can run offline evaluation.

Standalone CGF1 Evaluator

For a single subset:

SA-Co/Gold
SA-Co/Silver

python scripts/eval/standalone_cgf1.py \
  --pred_file /path/to/predictions/coco_predictions_segm.json \
  --gt_files \
    /path/to/annotations/gold_metaclip_merged_a_release_test.json \
    /path/to/annotations/gold_metaclip_merged_b_release_test.json \
    /path/to/annotations/gold_metaclip_merged_c_release_test.json

Gold requires 3 GT files (a, b, c) representing the triple annotations. The evaluator picks the most favorable annotation (oracle setting).

python scripts/eval/standalone_cgf1.py \
  --pred_file /path/to/predictions/coco_predictions_segm.json \
  --gt_files /path/to/annotations/silver_bdd100k_merged_test.json

Silver has only 1 GT file per subset.

Evaluation Notebooks

For aggregated results across all subsets:

# Run evaluation on all Gold/Silver subsets
jupyter notebook examples/saco_gold_silver_eval_example.ipynb

VEval Evaluation Script

SA-Co/VEval uses a specialized evaluator for video data:

Single Dataset
All 6 Datasets

python sam3/eval/saco_veval_eval.py one \
  --gt_annot_file data/annotation/saco_veval_sav_test.json \
  --pred_file data/predictions/saco_veval_sav_test_pred.json \
  --eval_res_file data/results/saco_veval_sav_test_eval_res.json

python sam3/eval/saco_veval_eval.py all \
  --gt_annot_dir data/annotation \
  --pred_dir data/predictions \
  --eval_res_dir data/results

Evaluates SA-V (test + val), YT-Temporal-1B (test + val), and SmartGlasses (test + val).

Evaluation Scripts

Alternatively, you can evaluate all subsets programmatically:

# Gold: Evaluate all 7 subsets and aggregate
python scripts/eval/gold/eval_sam3.py

# Silver: Evaluate all 10 subsets (add your own script)
# VEval: Use saco_veval_eval.py all as shown above

Understanding Metrics

Primary Metric: cgF1

The cgF1 (concept-grounded F1) metric combines:

Instance-level segmentation quality (IoU > 0.5)
Positive/negative prompt discrimination
Per-category F1 scores, averaged across all concepts

Additional Metrics

IL_MCC

Instance-level Matthews Correlation Coefficient - measures quality of positive/negative classification

positive_micro_F1

F1 score computed only on positive (present) prompts - measures segmentation quality when object exists

pHOTA

Promptable Higher Order Tracking Accuracy - video tracking quality with text conditioning

Changing Evaluation Type

By default, evaluations use segmentation masks. To evaluate bounding boxes instead:

from sam3.eval.cgf1_eval import CGF1Evaluator

evaluator = CGF1Evaluator(
    gt_path=gt_files,
    verbose=True,
    iou_type="bbox"  # Change from "segm" to "bbox"
)

results = evaluator.evaluate(pred_file)

Prediction Format

Your predictions must follow COCO result format:

[
  {
    "image_id": 10000000,
    "category_id": 1,
    "segmentation": {
      "counts": "...",
      "size": [600, 600]
    },
    "bbox": [x, y, w, h],
    "score": 0.95
  }
]

For SA-Co benchmarks:

image_id must match the image-NP pair ID (not just the image)
Each image-NP pair has a unique ID
For negative prompts, simply don’t include predictions (empty = correct rejection)

Visualization

Visualize ground truth annotations and predictions:

# View ground truth annotations
jupyter notebook examples/saco_gold_silver_vis_example.ipynb

Troubleshooting

Gold evaluation warning: expected 3 GT files

SA-Co/Gold requires all 3 annotation files (a, b, c) for each subset. Make sure you’ve downloaded all three:

gold_*_merged_a_release_test.json
gold_*_merged_b_release_test.json
gold_*_merged_c_release_test.json

VEval frame shifting issues

For YT-Temporal-1B, re-downloaded YouTube videos may have different specifications than during annotation. Use the annotation update script:

python scripts/eval/veval/saco_yt1b_annot_update.py \
  --yt1b_media_dir data/media/saco_yt1b/JPEGImages_6fps \
  --yt1b_input_annot_path data/annotation/saco_veval_yt1b_test.json \
  --yt1b_output_annot_path data/annotation/saco_veval_yt1b_test_updated.json

Predictions path not found

Ensure eval_base.yaml has the correct output directory where predictions will be saved. The directory must exist before running evaluations.

Out of memory errors

Reduce --num-gpus or process subsets sequentially rather than in parallel. For VEval, consider processing fewer videos at once.

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

Running Evaluations

Prerequisites

Install Dependencies

Download Datasets

Configuration

Edit Base Configuration

Running SAM 3 Evaluations

SA-Co/Gold Evaluation

All 7 Gold Subsets

SA-Co/Silver Evaluation

Offline Evaluation

Standalone CGF1 Evaluator

Evaluation Notebooks

VEval Evaluation Script

Evaluation Scripts

Understanding Metrics

Primary Metric: cgF1

Additional Metrics

IL_MCC

positive_micro_F1

pHOTA

Changing Evaluation Type

Prediction Format

Visualization

Troubleshooting

Next Steps

Benchmark Details

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Prerequisites

​Install Dependencies

​Download Datasets

​Configuration

​Edit Base Configuration

​Running SAM 3 Evaluations

​SA-Co/Gold Evaluation

​All 7 Gold Subsets

​SA-Co/Silver Evaluation

​Offline Evaluation

​Standalone CGF1 Evaluator

​Evaluation Notebooks

​VEval Evaluation Script

​Evaluation Scripts

​Understanding Metrics

​Primary Metric: cgF1

​Additional Metrics

IL_MCC

positive_micro_F1

pHOTA

​Changing Evaluation Type

​Prediction Format

​Visualization

​Troubleshooting

​Next Steps

Benchmark Details

API Reference

Build docs developers (and LLMs) love

Prerequisites

Install Dependencies

Download Datasets

Configuration

Edit Base Configuration

Running SAM 3 Evaluations

SA-Co/Gold Evaluation

All 7 Gold Subsets

SA-Co/Silver Evaluation

Offline Evaluation

Standalone CGF1 Evaluator

Evaluation Notebooks

VEval Evaluation Script

Evaluation Scripts

Understanding Metrics

Primary Metric: cgF1

Additional Metrics

Changing Evaluation Type

Prediction Format

Visualization

Troubleshooting

Next Steps