SA-Co/VEval is an evaluation dataset for promptable concept segmentation and tracking in videos. The benchmark comprises 3 domains , each with validation and test splits, covering diverse video sources from general videos to egocentric footage.
Overview
SA-Co/VEval evaluates models on their ability to segment and track objects specified by noun phrases throughout video sequences. The annotation format is similar to YTVIS format.
Frame Shifting Alert for YT-Temporal-1B : Due to the nature of YouTube videos, re-downloaded videos may not be exactly the same as those used during annotation. Additionally, ffmpeg may produce inconsistent frame extraction across environments, which can affect evaluation reproducibility.
Dataset Composition
3 Video Domains
SA-V General video frames from Segment Anything Video dataset
License : CC-BY-NC 4.0
Diverse everyday scenarios
YT-Temporal-1B YouTube videos from YT-Temporal-1B dataset
License : CC-BY-NC 4.0
Web video diversity
SmartGlasses Egocentric videos from smart glasses
License : CC-BY-4.0
First-person perspective
Installation
Install the SA-Co/VEval required environment:
pip install -e ".[veval]"
This enables:
scripts/eval/veval/saco_yt1b_downloader.py - Preparing YT-Temporal-1B frames
examples/saco_veval_eval_example.ipynb - Running offline evaluator
examples/saco_veval_vis_example.ipynb - Loading and visualizing data
Expected Folder Structure
data/
├── annotation/
│ ├── saco_veval_sav_test.json
│ ├── saco_veval_sav_val.json
│ ├── saco_veval_smartglasses_test.json
│ ├── saco_veval_smartglasses_val.json
│ ├── saco_veval_yt1b_test.json
│ └── saco_veval_yt1b_val.json
└── media/
├── saco_sav/
│ └── JPEGImages_24fps/
├── saco_sg/
│ └── JPEGImages_6fps/
└── saco_yt1b/
└── JPEGImages_6fps/
Download Dataset
Annotations
Download GT annotations from Hugging Face:
# All annotation files are in annotation/ directory:
# - saco_veval_sav_test.json / saco_veval_sav_val.json
# - saco_veval_yt1b_test.json / saco_veval_yt1b_val.json
# - saco_veval_smartglasses_test.json / saco_veval_smartglasses_val.json
Ready-to-Use Data (Roboflow)
Download preprocessed data from Roboflow:
Download via Preprocessing Steps
SA-V
YT-Temporal-1B
SmartGlasses
Download SA-V Videos Follow instructions at SA-V dataset . Only need: Extract and Merge cd data/media/saco_sav
# Download using dynamic links from SA-V website
wget -O sav_test.tar < sav_test.tar download lin k >
wget -O sav_val.tar < sav_val.tar download lin k >
# Extract
tar -xf sav_test.tar
tar -xf sav_val.tar
# Merge JPEGImages_24fps folders
mkdir JPEGImages_24fps
chmod -R u+w sav_test/ sav_val/
mv sav_test/JPEGImages_24fps/ * JPEGImages_24fps/
mv sav_val/JPEGImages_24fps/ * JPEGImages_24fps/
Ignore the Annotations_6fps folders - those contain SAM 2 annotations, not SA-Co/VEval annotations.
Prerequisites
Download media/yt1b_start_end_time.json from Hugging Face
Create cookies.txt following yt-dlp cookie export instructions
YouTube Account Ban Risk : Please see the full warnings in yt-dlp documentation regarding the risk of YouTube account bans when using cookies for video downloads.
Download and Prepare Frames python scripts/eval/veval/saco_yt1b_downloader.py \
--data_dir data/media/saco_yt1b \
--cookies_file data/media/saco_yt1b/cookies.txt \
--yt1b_start_end_time_file data/media/saco_yt1b/yt1b_start_end_time.json \
--yt1b_frame_prep_log_file data/media/saco_yt1b/yt1b_frame_prep.log
Parameters:
data_dir: Directory to download videos and store extracted frames
cookies_file: The cookies.txt file created above
yt1b_start_end_time_file: The yt1b_start_end_time.json downloaded above
yt1b_frame_prep_log_file: Log file tracking download and extraction status
Update Annotations for Unavailable Videos python scripts/eval/veval/saco_yt1b_annot_update.py \
--yt1b_media_dir data/media/saco_yt1b/JPEGImages_6fps \
--yt1b_input_annot_path data/annotation/saco_veval_yt1b_val.json \
--yt1b_output_annot_path data/annotation/saco_veval_yt1b_val_updated.json \
--yt1b_annot_update_log_path data/annotation/saco_veval_yt1b_val_updated.log
Not all YouTube videos may be available as videos can be deleted or become private. The update script removes annotations for unavailable videos. Evaluations may not be directly comparable with reported results if videos are missing.
Download SmartGlasses Videos cd data
# Download from Hugging Face
hf download facebook/SACo-VEval media/saco_sg.tar.gz \
--repo-type dataset \
--local-dir .
# Extract
cd media
tar -xzf saco_sg.tar.gz
The frames will be in data/media/saco_sg/JPEGImages_6fps/.
The format is similar to YTVIS format.
JSON Structure
data {
"info" : info,
"videos" : [video],
"annotations" : [annotation], # Only positive masklets
"categories" : [category], # Global noun phrase ID map
"video_np_pairs" : [video_np_pair] # Both positive and negative
}
Field Definitions
{
"version" : "v1" ,
"date" : "2025-09-24" ,
"description" : "SA-Co/VEval SA-V Test"
}
video {
"id" : int ,
"video_name" : str , # e.g., "sav_000000"
"file_names" : List[ str ],
"height" : int ,
"width" : int ,
"length" : int
}
List of positive masklets only: annotation {
"id" : int ,
"segmentations" : List[ RLE ], # One per frame
"bboxes" : List[List[ int , int , int , int ]],
"areas" : List[ int ],
"iscrowd" : int ,
"video_id" : str ,
"height" : int ,
"width" : int ,
"category_id" : int , # Maps to categories
"noun_phrase" : str
}
Global noun phrase ID mapping across all 3 domains: category {
"id" : int ,
"name" : str # The noun phrase
}
All video-NP pairs (positive and negative): video_np_pair {
"id" : int ,
"video_id" : str , # Maps to videos.id
"category_id" : int , # Maps to categories.id
"noun_phrase" : str ,
"num_masklets" : int # > 0 = positive, 0 = negative
}
num_masklets > 0: Positive pair (object exists)
num_masklets = 0: Negative pair (object doesn’t exist)
Example Visualization
See the annotation format and data visualization:
jupyter notebook examples / saco_veval_vis_example.ipynb
Benchmark Results
Performance Across Domains
Model SA-V cgF1 SA-V pHOTA YT1B cgF1 YT1B pHOTA SmartGlasses cgF1 SmartGlasses pHOTA Human 53.1 70.5 71.2 78.4 58.5 72.3 SAM 3 30.3 58.0 50.8 69.9 36.4 63.6
Additional Benchmarks
SAM 3 also achieves strong performance on standard video segmentation benchmarks:
Benchmark Metric SAM 3 Score LVVIS test mAP 36.3 BURST test HOTA 44.5
pHOTA (Promptable Higher Order Tracking Accuracy) measures both detection and tracking quality over time, accounting for text prompt conditioning.
Running Offline Evaluation
Two evaluation methods are provided:
1. Example Notebook
# Load eval results or run evaluation on the fly
jupyter notebook examples / saco_veval_eval_example.ipynb
2. Evaluation Script
The script sam3/eval/saco_veval_eval.py supports two modes:
Single Dataset
All Datasets
python sam3/eval/saco_veval_eval.py one \
--gt_annot_file data/annotation/saco_veval_sav_test.json \
--pred_file data/predictions/saco_veval_sav_test_pred.json \
--eval_res_file data/results/saco_veval_sav_test_eval_res.json
Parameters : gt_annot_file (Ground truth annotation file), pred_file (Model predictions in the same format), eval_res_file (Output file for evaluation results)python sam3/eval/saco_veval_eval.py all \
--gt_annot_dir data/annotation \
--pred_dir data/predictions \
--eval_res_dir data/results
This evaluates all 6 SA-Co/VEval datasets: SA-V (test + val), YT-Temporal-1B (test + val), and SmartGlasses (test + val). Parameters : gt_annot_dir (Directory containing all GT annotation files), pred_dir (Directory containing all prediction files), eval_res_dir (Directory where evaluation results will be written)
Next Steps
Run Evaluations Learn how to evaluate SAM 3 on SA-Co/VEval
Overview Return to evaluation overview