SA-Co/VEval Benchmark

SA-Co/VEval is an evaluation dataset for promptable concept segmentation and tracking in videos. The benchmark comprises 3 domains, each with validation and test splits, covering diverse video sources from general videos to egocentric footage.

Overview

SA-Co/VEval evaluates models on their ability to segment and track objects specified by noun phrases throughout video sequences. The annotation format is similar to YTVIS format.

Frame Shifting Alert for YT-Temporal-1B: Due to the nature of YouTube videos, re-downloaded videos may not be exactly the same as those used during annotation. Additionally, ffmpeg may produce inconsistent frame extraction across environments, which can affect evaluation reproducibility.

Dataset Composition

3 Video Domains

SA-V

General video frames from Segment Anything Video dataset

License: CC-BY-NC 4.0
Diverse everyday scenarios

YT-Temporal-1B

YouTube videos from YT-Temporal-1B dataset

License: CC-BY-NC 4.0
Web video diversity

SmartGlasses

Egocentric videos from smart glasses

License: CC-BY-4.0
First-person perspective

Installation

Install the SA-Co/VEval required environment:

pip install -e ".[veval]"

This enables:

scripts/eval/veval/saco_yt1b_downloader.py - Preparing YT-Temporal-1B frames
examples/saco_veval_eval_example.ipynb - Running offline evaluator
examples/saco_veval_vis_example.ipynb - Loading and visualizing data

Expected Folder Structure

data/
├── annotation/
│   ├── saco_veval_sav_test.json
│   ├── saco_veval_sav_val.json
│   ├── saco_veval_smartglasses_test.json
│   ├── saco_veval_smartglasses_val.json
│   ├── saco_veval_yt1b_test.json
│   └── saco_veval_yt1b_val.json
└── media/
    ├── saco_sav/
    │   └── JPEGImages_24fps/
    ├── saco_sg/
    │   └── JPEGImages_6fps/
    └── saco_yt1b/
        └── JPEGImages_6fps/

Download Dataset

Annotations

Download GT annotations from Hugging Face:

SA-Co/VEval

# All annotation files are in annotation/ directory:
# - saco_veval_sav_test.json / saco_veval_sav_val.json
# - saco_veval_yt1b_test.json / saco_veval_yt1b_val.json
# - saco_veval_smartglasses_test.json / saco_veval_smartglasses_val.json

Ready-to-Use Data (Roboflow)

Download preprocessed data from Roboflow:

Download via Preprocessing Steps

SA-V
YT-Temporal-1B
SmartGlasses

Download SA-V Videos

Follow instructions at SA-V dataset. Only need:

sav_test.tar
sav_val.tar

Extract and Merge

cd data/media/saco_sav

# Download using dynamic links from SA-V website
wget -O sav_test.tar <sav_test.tar download link>
wget -O sav_val.tar <sav_val.tar download link>

# Extract
tar -xf sav_test.tar
tar -xf sav_val.tar

# Merge JPEGImages_24fps folders
mkdir JPEGImages_24fps
chmod -R u+w sav_test/ sav_val/
mv sav_test/JPEGImages_24fps/* JPEGImages_24fps/
mv sav_val/JPEGImages_24fps/* JPEGImages_24fps/

Ignore the Annotations_6fps folders - those contain SAM 2 annotations, not SA-Co/VEval annotations.

Prerequisites

Download media/yt1b_start_end_time.json from Hugging Face
Create cookies.txt following yt-dlp cookie export instructions

YouTube Account Ban Risk: Please see the full warnings in yt-dlp documentation regarding the risk of YouTube account bans when using cookies for video downloads.

Download and Prepare Frames

python scripts/eval/veval/saco_yt1b_downloader.py \
  --data_dir data/media/saco_yt1b \
  --cookies_file data/media/saco_yt1b/cookies.txt \
  --yt1b_start_end_time_file data/media/saco_yt1b/yt1b_start_end_time.json \
  --yt1b_frame_prep_log_file data/media/saco_yt1b/yt1b_frame_prep.log

Parameters:

data_dir: Directory to download videos and store extracted frames
cookies_file: The cookies.txt file created above
yt1b_start_end_time_file: The yt1b_start_end_time.json downloaded above
yt1b_frame_prep_log_file: Log file tracking download and extraction status

Update Annotations for Unavailable Videos

python scripts/eval/veval/saco_yt1b_annot_update.py \
  --yt1b_media_dir data/media/saco_yt1b/JPEGImages_6fps \
  --yt1b_input_annot_path data/annotation/saco_veval_yt1b_val.json \
  --yt1b_output_annot_path data/annotation/saco_veval_yt1b_val_updated.json \
  --yt1b_annot_update_log_path data/annotation/saco_veval_yt1b_val_updated.log

Not all YouTube videos may be available as videos can be deleted or become private. The update script removes annotations for unavailable videos. Evaluations may not be directly comparable with reported results if videos are missing.

Download SmartGlasses Videos

cd data

# Download from Hugging Face
hf download facebook/SACo-VEval media/saco_sg.tar.gz \
  --repo-type dataset \
  --local-dir .

# Extract
cd media
tar -xzf saco_sg.tar.gz

The frames will be in data/media/saco_sg/JPEGImages_6fps/.

Annotation Format

The format is similar to YTVIS format.

JSON Structure

data {
    "info": info,
    "videos": [video],
    "annotations": [annotation],  # Only positive masklets
    "categories": [category],  # Global noun phrase ID map
    "video_np_pairs": [video_np_pair]  # Both positive and negative
}

Field Definitions

info

{
  "version": "v1",
  "date": "2025-09-24",
  "description": "SA-Co/VEval SA-V Test"
}

videos

video {
    "id": int,
    "video_name": str,  # e.g., "sav_000000"
    "file_names": List[str],
    "height": int,
    "width": int,
    "length": int
}

annotations

List of positive masklets only:

annotation {
    "id": int,
    "segmentations": List[RLE],  # One per frame
    "bboxes": List[List[int, int, int, int]],
    "areas": List[int],
    "iscrowd": int,
    "video_id": str,
    "height": int,
    "width": int,
    "category_id": int,  # Maps to categories
    "noun_phrase": str
}

Example Visualization

See the annotation format and data visualization:

jupyter notebook examples/saco_veval_vis_example.ipynb

Benchmark Results

Performance Across Domains

Model	SA-V cgF1	SA-V pHOTA	YT1B cgF1	YT1B pHOTA	SmartGlasses cgF1	SmartGlasses pHOTA
Human	53.1	70.5	71.2	78.4	58.5	72.3
SAM 3	30.3	58.0	50.8	69.9	36.4	63.6

Additional Benchmarks

SAM 3 also achieves strong performance on standard video segmentation benchmarks:

Benchmark	Metric	SAM 3 Score
LVVIS test	mAP	36.3
BURST test	HOTA	44.5

pHOTA (Promptable Higher Order Tracking Accuracy) measures both detection and tracking quality over time, accounting for text prompt conditioning.

Running Offline Evaluation

Two evaluation methods are provided:

1. Example Notebook

# Load eval results or run evaluation on the fly
jupyter notebook examples/saco_veval_eval_example.ipynb

2. Evaluation Script

The script sam3/eval/saco_veval_eval.py supports two modes:

Single Dataset
All Datasets

python sam3/eval/saco_veval_eval.py one \
  --gt_annot_file data/annotation/saco_veval_sav_test.json \
  --pred_file data/predictions/saco_veval_sav_test_pred.json \
  --eval_res_file data/results/saco_veval_sav_test_eval_res.json

Parameters: gt_annot_file (Ground truth annotation file), pred_file (Model predictions in the same format), eval_res_file (Output file for evaluation results)

python sam3/eval/saco_veval_eval.py all \
  --gt_annot_dir data/annotation \
  --pred_dir data/predictions \
  --eval_res_dir data/results

This evaluates all 6 SA-Co/VEval datasets: SA-V (test + val), YT-Temporal-1B (test + val), and SmartGlasses (test + val).Parameters: gt_annot_dir (Directory containing all GT annotation files), pred_dir (Directory containing all prediction files), eval_res_dir (Directory where evaluation results will be written)

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

SA-Co/VEval Benchmark

Overview

Dataset Composition

3 Video Domains

SA-V

YT-Temporal-1B

SmartGlasses

Installation

Expected Folder Structure

Download Dataset

Annotations

Ready-to-Use Data (Roboflow)

Download via Preprocessing Steps

Download SA-V Videos

Extract and Merge

Prerequisites

Download and Prepare Frames

Update Annotations for Unavailable Videos

Download SmartGlasses Videos

Annotation Format

JSON Structure

Field Definitions

Example Visualization

Benchmark Results

Performance Across Domains

Additional Benchmarks

Running Offline Evaluation

1. Example Notebook

2. Evaluation Script

Next Steps

Run Evaluations

Overview

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Overview

​Dataset Composition

​3 Video Domains

SA-V

YT-Temporal-1B

SmartGlasses

​Installation

​Expected Folder Structure

​Download Dataset

​Annotations

​Ready-to-Use Data (Roboflow)

​Download via Preprocessing Steps

​Download SA-V Videos

​Extract and Merge

​Prerequisites

​Download and Prepare Frames

​Update Annotations for Unavailable Videos

​Download SmartGlasses Videos

​Annotation Format

​JSON Structure

​Field Definitions

​Example Visualization

​Benchmark Results

​Performance Across Domains

​Additional Benchmarks

​Running Offline Evaluation

​1. Example Notebook

​2. Evaluation Script

​Next Steps

Run Evaluations

Overview

Build docs developers (and LLMs) love

Overview

Dataset Composition

3 Video Domains

Installation

Expected Folder Structure

Download Dataset

Annotations

Ready-to-Use Data (Roboflow)

Download via Preprocessing Steps

Download SA-V Videos

Extract and Merge

Prerequisites

Download and Prepare Frames

Update Annotations for Unavailable Videos

Download SmartGlasses Videos

Annotation Format

JSON Structure

Field Definitions

Example Visualization

Benchmark Results

Performance Across Domains

Additional Benchmarks

Running Offline Evaluation

1. Example Notebook

2. Evaluation Script

Next Steps