cgF1 Evaluation

Overview

The cgF1 (Comprehensive Grounding F1) metric evaluates segmentation models in realistic downstream application settings. It combines instance-level matching with image-level binary classification.

Metric Definition

cgF1 is computed as:

cgF1 = positive_micro_F1 × IL_MCC

Where:

positive_micro_F1: Micro-averaged F1 score across images with ground truth objects
IL_MCC: Image-level Matthews Correlation Coefficient

CGF1Evaluator

Class Initialization

from sam3.eval.cgf1_eval import CGF1Evaluator

evaluator = CGF1Evaluator(
    gt_path,
    iou_type="segm",
    verbose=False
)

Parameters

gt_path

str | list[str]

required

Path(s) to ground truth COCO JSON file(s). Multiple paths enable oracle evaluation.

iou_type

str

default:"'segm'"

Type of IoU evaluation: "segm" for masks or "bbox" for boxes.

verbose

bool

default:"False"

Whether to print detailed evaluation progress.

evaluate

Run cgF1 evaluation on predictions.

results = evaluator.evaluate(pred_file)

pred_file

str

required

Path to predictions COCO JSON file.

results

dict

Dictionary of metric values with keys like:

cgF1_eval_segm_cgF1: Main cgF1 score
cgF1_eval_segm_precision: Precision
cgF1_eval_segm_recall: Recall
cgF1_eval_segm_F1: Standard F1
cgF1_eval_segm_IL_F1: Image-level F1
cgF1_eval_segm_IL_MCC: Image-level MCC
Additional metrics at IoU 0.5 and 0.75

Metrics Explained

Instance-Level Metrics

These metrics are computed at the instance (object) level: cgF1 - Main metric combining instance and image-level performance precision - True Positives / (True Positives + False Positives) recall - True Positives / (True Positives + False Negatives) F1 - Harmonic mean of precision and recall positive_micro_F1 - Micro-averaged F1 on images with ground truth (excludes true negatives) positive_micro_precision - Micro-averaged precision on positive images positive_macro_F1 - Macro-averaged F1 across positive images

Image-Level Metrics

These treat each image as a binary classification (has objects vs. no objects): IL_precision - Image-level precision IL_recall - Image-level recall IL_F1 - Image-level F1 score IL_FPR - Image-level false positive rate IL_MCC - Image-level Matthews Correlation Coefficient

IoU Thresholds

Metrics are reported at different IoU thresholds:

No suffix: Averaged over IoU ∈ [0.5, 0.95] (step 0.05)
@0.5: IoU threshold = 0.5
@0.75: IoU threshold = 0.75

Example Usage

Basic Evaluation

from sam3.eval.cgf1_eval import CGF1Evaluator

# Initialize evaluator
evaluator = CGF1Evaluator(
    gt_path="ground_truth.json",
    iou_type="segm",
    verbose=True
)

# Run evaluation
results = evaluator.evaluate("predictions.json")

# Print main metric
print(f"cgF1: {results['cgF1_eval_segm_cgF1']:.3f}")
print(f"Precision: {results['cgF1_eval_segm_precision']:.3f}")
print(f"Recall: {results['cgF1_eval_segm_recall']:.3f}")

Oracle Evaluation

Oracle evaluation uses multiple ground truth annotations and picks the best match:

# Multiple annotators provided different ground truths
evaluator = CGF1Evaluator(
    gt_path=[
        "annotator1.json",
        "annotator2.json",
        "annotator3.json"
    ],
    iou_type="segm"
)

results = evaluator.evaluate("predictions.json")

All Metrics

results = evaluator.evaluate("predictions.json")

# Instance-level metrics (averaged over IoU thresholds)
print(f"cgF1: {results['cgF1_eval_segm_cgF1']:.3f}")
print(f"Precision: {results['cgF1_eval_segm_precision']:.3f}")
print(f"Recall: {results['cgF1_eval_segm_recall']:.3f}")
print(f"F1: {results['cgF1_eval_segm_F1']:.3f}")
print(f"Positive Micro F1: {results['cgF1_eval_segm_positive_micro_F1']:.3f}")

# Image-level metrics
print(f"IL Precision: {results['cgF1_eval_segm_IL_precision']:.3f}")
print(f"IL Recall: {results['cgF1_eval_segm_IL_recall']:.3f}")
print(f"IL F1: {results['cgF1_eval_segm_IL_F1']:.3f}")
print(f"IL MCC: {results['cgF1_eval_segm_IL_MCC']:.3f}")

# Metrics at specific IoU thresholds
print(f"[email protected]: {results['[email protected]']:.3f}")
print(f"[email protected]: {results['[email protected]']:.3f}")

Ground Truth Format

Ground truth file must be in COCO format with an additional field:

{
  "images": [
    {
      "id": 1,
      "width": 640,
      "height": 480,
      "is_instance_exhaustive": true  // Required: whether all instances are annotated
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "segmentation": {...},  // RLE format
      "area": 5000,
      "bbox": [x, y, w, h]
    }
  ],
  "categories": [...]
}

Important: Only images with "is_instance_exhaustive": true are evaluated.

Predictions Format

Predictions must be in COCO result format:

[
  {
    "image_id": 1,
    "category_id": 1,
    "segmentation": {...},  // RLE format
    "score": 0.95
  }
]

Understanding cgF1

Why cgF1?

Traditional metrics like COCO AP have limitations:

Insensitive to false positives: High AP even with many false detections
No penalty for hallucinations: Predicting objects on empty images is not penalized
Not application-focused: Doesn’t reflect downstream task performance

cgF1 addresses these by:

Penalizing false positives through precision
Evaluating image-level binary classification (object present or not)
Combining both aspects into a single metric

When to Use cgF1

Use cgF1 when:

Evaluating open-vocabulary or referring expression models
False positives are costly in your application
You need to detect when objects are absent
Comparing models on realistic downstream tasks

Use COCO AP when:

Traditional object detection evaluation
All images contain objects (no negatives)
Focusing purely on localization quality

Advanced Usage

Custom IoU Threshold

The evaluator computes metrics across IoU thresholds [0.5, 0.95] by default. Access specific thresholds:

results = evaluator.evaluate("predictions.json")

# IoU threshold 0.5 (easier)
print(f"[email protected]: {results['[email protected]']:.3f}")

# IoU threshold 0.75 (harder)
print(f"[email protected]: {results['[email protected]']:.3f}")

Box Evaluation

# Evaluate bounding boxes instead of masks
evaluator = CGF1Evaluator(
    gt_path="ground_truth.json",
    iou_type="bbox"  # Use box IoU
)

results = evaluator.evaluate("box_predictions.json")

Implementation Notes

Uses Hungarian matching for instance assignment
Confidence threshold is fixed at 0.5 (can be modified in CGF1Eval class)
Image-level metrics use binary classification (any object vs. no object)
Excludes images marked as not exhaustively annotated
For oracle evaluation, selects best F1 among multiple ground truths per image

Model Builders

Image Inference

Video Inference

Agent

Evaluation

cgF1 Evaluation

Overview

Metric Definition

CGF1Evaluator

Class Initialization

Parameters

evaluate

Metrics Explained

Instance-Level Metrics

Image-Level Metrics

IoU Thresholds

Example Usage

Basic Evaluation

Oracle Evaluation

All Metrics

Ground Truth Format

Predictions Format

Understanding cgF1

Why cgF1?

When to Use cgF1

Advanced Usage

Custom IoU Threshold

Box Evaluation

Implementation Notes

See Also

Build docs developers (and LLMs) love

Model Builders

Image Inference

Video Inference

Agent

Evaluation

​Overview

​Metric Definition

​CGF1Evaluator

​Class Initialization

​Parameters

​evaluate

​Metrics Explained

​Instance-Level Metrics

​Image-Level Metrics

​IoU Thresholds

​Example Usage

​Basic Evaluation

​Oracle Evaluation

​All Metrics

​Ground Truth Format

​Predictions Format

​Understanding cgF1

​Why cgF1?

​When to Use cgF1

​Advanced Usage

​Custom IoU Threshold

​Box Evaluation

​Implementation Notes

​See Also

Build docs developers (and LLMs) love

Overview

Metric Definition

CGF1Evaluator

Class Initialization

Parameters

evaluate

Metrics Explained

Instance-Level Metrics

Image-Level Metrics

IoU Thresholds

Example Usage

Basic Evaluation

Oracle Evaluation

All Metrics

Ground Truth Format

Predictions Format

Understanding cgF1

Why cgF1?

When to Use cgF1

Advanced Usage

Custom IoU Threshold

Box Evaluation

Implementation Notes

See Also