Skip to main content
This guide covers running evaluations on all SA-Co benchmarks: Gold, Silver, and VEval. You can evaluate SAM 3 directly or run offline evaluation on your own predictions.

Prerequisites

Install Dependencies

# For Gold and Silver evaluations
pip install -e ".[train,dev]"

# For VEval video evaluations (additional)
pip install -e ".[veval]"

Download Datasets

Before running evaluations, download the appropriate datasets:

Configuration

Edit Base Configuration

All evaluations require updating sam3/train/configs/eval_base.yaml with your dataset paths:
# Edit this file with:
# - Paths to downloaded images/videos
# - Paths to annotation files
# - Output directory for predictions
# - (Optional) SLURM cluster configuration

Running SAM 3 Evaluations

SA-Co/Gold Evaluation

SA-Co/Gold has 7 subsets to evaluate. You can run locally or on a SLURM cluster.
# Run on local machine with GPUs
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml \
  --use-cluster 0 \
  --num-gpus 1
Adjust --num-gpus based on your available hardware.

All 7 Gold Subsets

python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_metaclip_nps.yaml \
  --use-cluster 0 --num-gpus 1
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_sa1b_nps.yaml \
  --use-cluster 0 --num-gpus 1
This subset uses SA-1B images, not MetaCLIP images.
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_attributes.yaml \
  --use-cluster 0 --num-gpus 1
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_crowded.yaml \
  --use-cluster 0 --num-gpus 1
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_wiki_common.yaml \
  --use-cluster 0 --num-gpus 1
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_fg_food.yaml \
  --use-cluster 0 --num-gpus 1
python sam3/train/train.py \
  -c configs/gold_image_evals/sam3_gold_image_fg_sports.yaml \
  --use-cluster 0 --num-gpus 1

SA-Co/Silver Evaluation

SA-Co/Silver has 10 subsets to evaluate.
# Example: BDD100k subset
python sam3/train/train.py \
  -c configs/silver_image_evals/sam3_gold_image_bdd100k.yaml \
  --use-cluster 0 \
  --num-gpus 1
Replace the config file for other subsets:
  • sam3_gold_image_bdd100k.yaml
  • sam3_gold_image_droid.yaml
  • sam3_gold_image_ego4d.yaml
  • sam3_gold_image_food_rec.yaml
  • sam3_gold_image_geode.yaml
  • sam3_gold_image_inaturalist.yaml
  • sam3_gold_image_nga_art.yaml
  • sam3_gold_image_sav.yaml
  • sam3_gold_image_yt1b.yaml
  • sam3_gold_image_fathomnet.yaml

Offline Evaluation

If you have predictions in COCO result format, you can run offline evaluation.

Standalone CGF1 Evaluator

For a single subset:
python scripts/eval/standalone_cgf1.py \
  --pred_file /path/to/predictions/coco_predictions_segm.json \
  --gt_files \
    /path/to/annotations/gold_metaclip_merged_a_release_test.json \
    /path/to/annotations/gold_metaclip_merged_b_release_test.json \
    /path/to/annotations/gold_metaclip_merged_c_release_test.json
Gold requires 3 GT files (a, b, c) representing the triple annotations. The evaluator picks the most favorable annotation (oracle setting).

Evaluation Notebooks

For aggregated results across all subsets:
# Run evaluation on all Gold/Silver subsets
jupyter notebook examples/saco_gold_silver_eval_example.ipynb

VEval Evaluation Script

SA-Co/VEval uses a specialized evaluator for video data:
python sam3/eval/saco_veval_eval.py one \
  --gt_annot_file data/annotation/saco_veval_sav_test.json \
  --pred_file data/predictions/saco_veval_sav_test_pred.json \
  --eval_res_file data/results/saco_veval_sav_test_eval_res.json

Evaluation Scripts

Alternatively, you can evaluate all subsets programmatically:
# Gold: Evaluate all 7 subsets and aggregate
python scripts/eval/gold/eval_sam3.py

# Silver: Evaluate all 10 subsets (add your own script)
# VEval: Use saco_veval_eval.py all as shown above

Understanding Metrics

Primary Metric: cgF1

The cgF1 (concept-grounded F1) metric combines:
  • Instance-level segmentation quality (IoU > 0.5)
  • Positive/negative prompt discrimination
  • Per-category F1 scores, averaged across all concepts

Additional Metrics

IL_MCC

Instance-level Matthews Correlation Coefficient - measures quality of positive/negative classification

positive_micro_F1

F1 score computed only on positive (present) prompts - measures segmentation quality when object exists

pHOTA

Promptable Higher Order Tracking Accuracy - video tracking quality with text conditioning

Changing Evaluation Type

By default, evaluations use segmentation masks. To evaluate bounding boxes instead:
from sam3.eval.cgf1_eval import CGF1Evaluator

evaluator = CGF1Evaluator(
    gt_path=gt_files,
    verbose=True,
    iou_type="bbox"  # Change from "segm" to "bbox"
)

results = evaluator.evaluate(pred_file)

Prediction Format

Your predictions must follow COCO result format:
[
  {
    "image_id": 10000000,
    "category_id": 1,
    "segmentation": {
      "counts": "...",
      "size": [600, 600]
    },
    "bbox": [x, y, w, h],
    "score": 0.95
  }
]
For SA-Co benchmarks:
  • image_id must match the image-NP pair ID (not just the image)
  • Each image-NP pair has a unique ID
  • For negative prompts, simply don’t include predictions (empty = correct rejection)

Visualization

Visualize ground truth annotations and predictions:
# View ground truth annotations
jupyter notebook examples/saco_gold_silver_vis_example.ipynb

Troubleshooting

SA-Co/Gold requires all 3 annotation files (a, b, c) for each subset. Make sure you’ve downloaded all three:
  • gold_*_merged_a_release_test.json
  • gold_*_merged_b_release_test.json
  • gold_*_merged_c_release_test.json
For YT-Temporal-1B, re-downloaded YouTube videos may have different specifications than during annotation. Use the annotation update script:
python scripts/eval/veval/saco_yt1b_annot_update.py \
  --yt1b_media_dir data/media/saco_yt1b/JPEGImages_6fps \
  --yt1b_input_annot_path data/annotation/saco_veval_yt1b_test.json \
  --yt1b_output_annot_path data/annotation/saco_veval_yt1b_test_updated.json
Ensure eval_base.yaml has the correct output directory where predictions will be saved. The directory must exist before running evaluations.
Reduce --num-gpus or process subsets sequentially rather than in parallel. For VEval, consider processing fewer videos at once.

Next Steps

Benchmark Details

Learn more about SA-Co benchmarks and metrics

API Reference

Explore SAM 3 predictor API for generating predictions

Build docs developers (and LLMs) love