Skip to main content
SA-Co/VEval is an evaluation dataset for promptable concept segmentation and tracking in videos. The benchmark comprises 3 domains, each with validation and test splits, covering diverse video sources from general videos to egocentric footage.

Overview

SA-Co/VEval evaluates models on their ability to segment and track objects specified by noun phrases throughout video sequences. The annotation format is similar to YTVIS format.
Frame Shifting Alert for YT-Temporal-1B: Due to the nature of YouTube videos, re-downloaded videos may not be exactly the same as those used during annotation. Additionally, ffmpeg may produce inconsistent frame extraction across environments, which can affect evaluation reproducibility.

Dataset Composition

3 Video Domains

SA-V

General video frames from Segment Anything Video dataset
  • License: CC-BY-NC 4.0
  • Diverse everyday scenarios

YT-Temporal-1B

YouTube videos from YT-Temporal-1B dataset
  • License: CC-BY-NC 4.0
  • Web video diversity

SmartGlasses

Egocentric videos from smart glasses
  • License: CC-BY-4.0
  • First-person perspective

Installation

Install the SA-Co/VEval required environment:
pip install -e ".[veval]"
This enables:
  • scripts/eval/veval/saco_yt1b_downloader.py - Preparing YT-Temporal-1B frames
  • examples/saco_veval_eval_example.ipynb - Running offline evaluator
  • examples/saco_veval_vis_example.ipynb - Loading and visualizing data

Expected Folder Structure

data/
├── annotation/
│   ├── saco_veval_sav_test.json
│   ├── saco_veval_sav_val.json
│   ├── saco_veval_smartglasses_test.json
│   ├── saco_veval_smartglasses_val.json
│   ├── saco_veval_yt1b_test.json
│   └── saco_veval_yt1b_val.json
└── media/
    ├── saco_sav/
    │   └── JPEGImages_24fps/
    ├── saco_sg/
    │   └── JPEGImages_6fps/
    └── saco_yt1b/
        └── JPEGImages_6fps/

Download Dataset

Annotations

Download GT annotations from Hugging Face:
# All annotation files are in annotation/ directory:
# - saco_veval_sav_test.json / saco_veval_sav_val.json
# - saco_veval_yt1b_test.json / saco_veval_yt1b_val.json
# - saco_veval_smartglasses_test.json / saco_veval_smartglasses_val.json

Ready-to-Use Data (Roboflow)

Download preprocessed data from Roboflow:

Download via Preprocessing Steps

Download SA-V Videos

Follow instructions at SA-V dataset. Only need:
  • sav_test.tar
  • sav_val.tar

Extract and Merge

cd data/media/saco_sav

# Download using dynamic links from SA-V website
wget -O sav_test.tar <sav_test.tar download link>
wget -O sav_val.tar <sav_val.tar download link>

# Extract
tar -xf sav_test.tar
tar -xf sav_val.tar

# Merge JPEGImages_24fps folders
mkdir JPEGImages_24fps
chmod -R u+w sav_test/ sav_val/
mv sav_test/JPEGImages_24fps/* JPEGImages_24fps/
mv sav_val/JPEGImages_24fps/* JPEGImages_24fps/
Ignore the Annotations_6fps folders - those contain SAM 2 annotations, not SA-Co/VEval annotations.

Annotation Format

The format is similar to YTVIS format.

JSON Structure

data {
    "info": info,
    "videos": [video],
    "annotations": [annotation],  # Only positive masklets
    "categories": [category],  # Global noun phrase ID map
    "video_np_pairs": [video_np_pair]  # Both positive and negative
}

Field Definitions

{
  "version": "v1",
  "date": "2025-09-24",
  "description": "SA-Co/VEval SA-V Test"
}
video {
    "id": int,
    "video_name": str,  # e.g., "sav_000000"
    "file_names": List[str],
    "height": int,
    "width": int,
    "length": int
}
List of positive masklets only:
annotation {
    "id": int,
    "segmentations": List[RLE],  # One per frame
    "bboxes": List[List[int, int, int, int]],
    "areas": List[int],
    "iscrowd": int,
    "video_id": str,
    "height": int,
    "width": int,
    "category_id": int,  # Maps to categories
    "noun_phrase": str
}
Global noun phrase ID mapping across all 3 domains:
category {
    "id": int,
    "name": str  # The noun phrase
}
All video-NP pairs (positive and negative):
video_np_pair {
    "id": int,
    "video_id": str,  # Maps to videos.id
    "category_id": int,  # Maps to categories.id
    "noun_phrase": str,
    "num_masklets": int  # > 0 = positive, 0 = negative
}
  • num_masklets > 0: Positive pair (object exists)
  • num_masklets = 0: Negative pair (object doesn’t exist)

Example Visualization

See the annotation format and data visualization:
jupyter notebook examples/saco_veval_vis_example.ipynb

Benchmark Results

Performance Across Domains

ModelSA-V cgF1SA-V pHOTAYT1B cgF1YT1B pHOTASmartGlasses cgF1SmartGlasses pHOTA
Human53.170.571.278.458.572.3
SAM 330.358.050.869.936.463.6

Additional Benchmarks

SAM 3 also achieves strong performance on standard video segmentation benchmarks:
BenchmarkMetricSAM 3 Score
LVVIS testmAP36.3
BURST testHOTA44.5
pHOTA (Promptable Higher Order Tracking Accuracy) measures both detection and tracking quality over time, accounting for text prompt conditioning.

Running Offline Evaluation

Two evaluation methods are provided:

1. Example Notebook

# Load eval results or run evaluation on the fly
jupyter notebook examples/saco_veval_eval_example.ipynb

2. Evaluation Script

The script sam3/eval/saco_veval_eval.py supports two modes:
python sam3/eval/saco_veval_eval.py one \
  --gt_annot_file data/annotation/saco_veval_sav_test.json \
  --pred_file data/predictions/saco_veval_sav_test_pred.json \
  --eval_res_file data/results/saco_veval_sav_test_eval_res.json
Parameters: gt_annot_file (Ground truth annotation file), pred_file (Model predictions in the same format), eval_res_file (Output file for evaluation results)

Next Steps

Run Evaluations

Learn how to evaluate SAM 3 on SA-Co/VEval

Overview

Return to evaluation overview

Build docs developers (and LLMs) love