Prerequisites
Install Dependencies
Download Datasets
Before running evaluations, download the appropriate datasets:- SA-Co/Gold: See download instructions
- SA-Co/Silver: See download instructions
- SA-Co/VEval: See download instructions
Configuration
Edit Base Configuration
All evaluations require updatingsam3/train/configs/eval_base.yaml with your dataset paths:
Running SAM 3 Evaluations
SA-Co/Gold Evaluation
SA-Co/Gold has 7 subsets to evaluate. You can run locally or on a SLURM cluster.- Local Execution
- SLURM Cluster
--num-gpus based on your available hardware.All 7 Gold Subsets
MetaCLIP Captioner NPs
MetaCLIP Captioner NPs
SA-1B Captioner NPs
SA-1B Captioner NPs
This subset uses SA-1B images, not MetaCLIP images.
Attributes
Attributes
Crowded Scenes
Crowded Scenes
Wiki-Common1K
Wiki-Common1K
Wiki-Food/Drink
Wiki-Food/Drink
Wiki-Sports Equipment
Wiki-Sports Equipment
SA-Co/Silver Evaluation
SA-Co/Silver has 10 subsets to evaluate.sam3_gold_image_bdd100k.yamlsam3_gold_image_droid.yamlsam3_gold_image_ego4d.yamlsam3_gold_image_food_rec.yamlsam3_gold_image_geode.yamlsam3_gold_image_inaturalist.yamlsam3_gold_image_nga_art.yamlsam3_gold_image_sav.yamlsam3_gold_image_yt1b.yamlsam3_gold_image_fathomnet.yaml
Offline Evaluation
If you have predictions in COCO result format, you can run offline evaluation.Standalone CGF1 Evaluator
For a single subset:- SA-Co/Gold
- SA-Co/Silver
Gold requires 3 GT files (a, b, c) representing the triple annotations. The evaluator picks the most favorable annotation (oracle setting).
Evaluation Notebooks
For aggregated results across all subsets:VEval Evaluation Script
SA-Co/VEval uses a specialized evaluator for video data:- Single Dataset
- All 6 Datasets
Evaluation Scripts
Alternatively, you can evaluate all subsets programmatically:Understanding Metrics
Primary Metric: cgF1
The cgF1 (concept-grounded F1) metric combines:- Instance-level segmentation quality (IoU > 0.5)
- Positive/negative prompt discrimination
- Per-category F1 scores, averaged across all concepts
Additional Metrics
IL_MCC
Instance-level Matthews Correlation Coefficient - measures quality of positive/negative classification
positive_micro_F1
F1 score computed only on positive (present) prompts - measures segmentation quality when object exists
pHOTA
Promptable Higher Order Tracking Accuracy - video tracking quality with text conditioning
Changing Evaluation Type
By default, evaluations use segmentation masks. To evaluate bounding boxes instead:Prediction Format
Your predictions must follow COCO result format:Visualization
Visualize ground truth annotations and predictions:Troubleshooting
Gold evaluation warning: expected 3 GT files
Gold evaluation warning: expected 3 GT files
SA-Co/Gold requires all 3 annotation files (a, b, c) for each subset. Make sure you’ve downloaded all three:
gold_*_merged_a_release_test.jsongold_*_merged_b_release_test.jsongold_*_merged_c_release_test.json
VEval frame shifting issues
VEval frame shifting issues
For YT-Temporal-1B, re-downloaded YouTube videos may have different specifications than during annotation. Use the annotation update script:
Predictions path not found
Predictions path not found
Ensure
eval_base.yaml has the correct output directory where predictions will be saved. The directory must exist before running evaluations.Out of memory errors
Out of memory errors
Reduce
--num-gpus or process subsets sequentially rather than in parallel. For VEval, consider processing fewer videos at once.Next Steps
Benchmark Details
Learn more about SA-Co benchmarks and metrics
API Reference
Explore SAM 3 predictor API for generating predictions