Skip to main content

Overview

The cgF1 (Comprehensive Grounding F1) metric evaluates segmentation models in realistic downstream application settings. It combines instance-level matching with image-level binary classification.

Metric Definition

cgF1 is computed as:
cgF1 = positive_micro_F1 × IL_MCC
Where:
  • positive_micro_F1: Micro-averaged F1 score across images with ground truth objects
  • IL_MCC: Image-level Matthews Correlation Coefficient

CGF1Evaluator

Class Initialization

from sam3.eval.cgf1_eval import CGF1Evaluator

evaluator = CGF1Evaluator(
    gt_path,
    iou_type="segm",
    verbose=False
)

Parameters

gt_path
str | list[str]
required
Path(s) to ground truth COCO JSON file(s). Multiple paths enable oracle evaluation.
iou_type
str
default:"'segm'"
Type of IoU evaluation: "segm" for masks or "bbox" for boxes.
verbose
bool
default:"False"
Whether to print detailed evaluation progress.

evaluate

Run cgF1 evaluation on predictions.
results = evaluator.evaluate(pred_file)
pred_file
str
required
Path to predictions COCO JSON file.
results
dict
Dictionary of metric values with keys like:
  • cgF1_eval_segm_cgF1: Main cgF1 score
  • cgF1_eval_segm_precision: Precision
  • cgF1_eval_segm_recall: Recall
  • cgF1_eval_segm_F1: Standard F1
  • cgF1_eval_segm_IL_F1: Image-level F1
  • cgF1_eval_segm_IL_MCC: Image-level MCC
  • Additional metrics at IoU 0.5 and 0.75

Metrics Explained

Instance-Level Metrics

These metrics are computed at the instance (object) level: cgF1 - Main metric combining instance and image-level performance precision - True Positives / (True Positives + False Positives) recall - True Positives / (True Positives + False Negatives) F1 - Harmonic mean of precision and recall positive_micro_F1 - Micro-averaged F1 on images with ground truth (excludes true negatives) positive_micro_precision - Micro-averaged precision on positive images positive_macro_F1 - Macro-averaged F1 across positive images

Image-Level Metrics

These treat each image as a binary classification (has objects vs. no objects): IL_precision - Image-level precision IL_recall - Image-level recall IL_F1 - Image-level F1 score IL_FPR - Image-level false positive rate IL_MCC - Image-level Matthews Correlation Coefficient

IoU Thresholds

Metrics are reported at different IoU thresholds:
  • No suffix: Averaged over IoU ∈ [0.5, 0.95] (step 0.05)
  • @0.5: IoU threshold = 0.5
  • @0.75: IoU threshold = 0.75

Example Usage

Basic Evaluation

from sam3.eval.cgf1_eval import CGF1Evaluator

# Initialize evaluator
evaluator = CGF1Evaluator(
    gt_path="ground_truth.json",
    iou_type="segm",
    verbose=True
)

# Run evaluation
results = evaluator.evaluate("predictions.json")

# Print main metric
print(f"cgF1: {results['cgF1_eval_segm_cgF1']:.3f}")
print(f"Precision: {results['cgF1_eval_segm_precision']:.3f}")
print(f"Recall: {results['cgF1_eval_segm_recall']:.3f}")

Oracle Evaluation

Oracle evaluation uses multiple ground truth annotations and picks the best match:
# Multiple annotators provided different ground truths
evaluator = CGF1Evaluator(
    gt_path=[
        "annotator1.json",
        "annotator2.json",
        "annotator3.json"
    ],
    iou_type="segm"
)

results = evaluator.evaluate("predictions.json")

All Metrics

results = evaluator.evaluate("predictions.json")

# Instance-level metrics (averaged over IoU thresholds)
print(f"cgF1: {results['cgF1_eval_segm_cgF1']:.3f}")
print(f"Precision: {results['cgF1_eval_segm_precision']:.3f}")
print(f"Recall: {results['cgF1_eval_segm_recall']:.3f}")
print(f"F1: {results['cgF1_eval_segm_F1']:.3f}")
print(f"Positive Micro F1: {results['cgF1_eval_segm_positive_micro_F1']:.3f}")

# Image-level metrics
print(f"IL Precision: {results['cgF1_eval_segm_IL_precision']:.3f}")
print(f"IL Recall: {results['cgF1_eval_segm_IL_recall']:.3f}")
print(f"IL F1: {results['cgF1_eval_segm_IL_F1']:.3f}")
print(f"IL MCC: {results['cgF1_eval_segm_IL_MCC']:.3f}")

# Metrics at specific IoU thresholds
print(f"[email protected]: {results['[email protected]']:.3f}")
print(f"[email protected]: {results['[email protected]']:.3f}")

Ground Truth Format

Ground truth file must be in COCO format with an additional field:
{
  "images": [
    {
      "id": 1,
      "width": 640,
      "height": 480,
      "is_instance_exhaustive": true  // Required: whether all instances are annotated
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "segmentation": {...},  // RLE format
      "area": 5000,
      "bbox": [x, y, w, h]
    }
  ],
  "categories": [...]
}
Important: Only images with "is_instance_exhaustive": true are evaluated.

Predictions Format

Predictions must be in COCO result format:
[
  {
    "image_id": 1,
    "category_id": 1,
    "segmentation": {...},  // RLE format
    "score": 0.95
  }
]

Understanding cgF1

Why cgF1?

Traditional metrics like COCO AP have limitations:
  1. Insensitive to false positives: High AP even with many false detections
  2. No penalty for hallucinations: Predicting objects on empty images is not penalized
  3. Not application-focused: Doesn’t reflect downstream task performance
cgF1 addresses these by:
  • Penalizing false positives through precision
  • Evaluating image-level binary classification (object present or not)
  • Combining both aspects into a single metric

When to Use cgF1

Use cgF1 when:
  • Evaluating open-vocabulary or referring expression models
  • False positives are costly in your application
  • You need to detect when objects are absent
  • Comparing models on realistic downstream tasks
Use COCO AP when:
  • Traditional object detection evaluation
  • All images contain objects (no negatives)
  • Focusing purely on localization quality

Advanced Usage

Custom IoU Threshold

The evaluator computes metrics across IoU thresholds [0.5, 0.95] by default. Access specific thresholds:
results = evaluator.evaluate("predictions.json")

# IoU threshold 0.5 (easier)
print(f"[email protected]: {results['[email protected]']:.3f}")

# IoU threshold 0.75 (harder)
print(f"[email protected]: {results['[email protected]']:.3f}")

Box Evaluation

# Evaluate bounding boxes instead of masks
evaluator = CGF1Evaluator(
    gt_path="ground_truth.json",
    iou_type="bbox"  # Use box IoU
)

results = evaluator.evaluate("box_predictions.json")

Implementation Notes

  • Uses Hungarian matching for instance assignment
  • Confidence threshold is fixed at 0.5 (can be modified in CGF1Eval class)
  • Image-level metrics use binary classification (any object vs. no object)
  • Excludes images marked as not exhaustively annotated
  • For oracle evaluation, selects best F1 among multiple ground truths per image

See Also

  • COCO Evaluation - Standard COCO metrics
  • SAM 3 paper for detailed cgF1 definition and motivation

Build docs developers (and LLMs) love