Skip to main content

Overview

The CocoEvaluator class provides standard COCO evaluation metrics (AP, AR) for segmentation and detection tasks with distributed training support.

CocoEvaluator

Class Initialization

from sam3.eval.coco_eval import CocoEvaluator

evaluator = CocoEvaluator(
    coco_gt,
    iou_types=["segm"],
    useCats=False,
    dump_dir=None,
    postprocessor=None,
    average_by_rarity=False,
    use_normalized_areas=True,
    maxdets=[1, 10, 100],
    exhaustive_only=False,
    all_exhaustive_only=True
)

Parameters

coco_gt
COCO | list[COCO]
required
COCO API object(s) containing ground truth annotations. Can be a single COCO object or list for oracle evaluation.
iou_types
list[str]
required
Types of IoU to evaluate: ["segm"] for masks, ["bbox"] for boxes, or both.
useCats
bool
required
Whether to use categories for evaluation. Set False for open-vocabulary tasks.
dump_dir
str | None
required
Directory to dump predictions. If None, predictions are not saved.
postprocessor
object
required
Postprocessor module to convert model outputs to COCO format.
average_by_rarity
bool
default:"False"
Whether to compute AP separately for different object rarity buckets and average.
use_normalized_areas
bool
default:"True"
Whether object areas are normalized by image area. Affects size bucket definitions.
maxdets
list[int]
default:"[1, 10, 100]"
Maximum number of detections to evaluate per image.
exhaustive_only
bool
default:"False"
Whether to restrict evaluation to exhaustively annotated images only.
all_exhaustive_only
bool
default:"True"
Whether to require all ground truth sources to be exhaustive (for oracle evaluation).

Methods

update

Update evaluator with model outputs.
evaluator.update(
    model_outputs,
    targets,
    image_ids
)

synchronize_between_processes

Synchronize predictions across distributed processes.
evaluator.synchronize_between_processes()

accumulate

Accumulate evaluation results.
evaluator.accumulate(imgIds=None)

summarize

Compute and print summary metrics.
results = evaluator.summarize()
results
dict
Dictionary containing COCO metrics:
  • coco_eval_masks_AP: Mask AP (averaged over IoU thresholds)
  • coco_eval_masks_AP_50: Mask AP @ IoU=0.5
  • coco_eval_masks_AP_75: Mask AP @ IoU=0.75
  • coco_eval_masks_AP_{size}: AP by size (tiny/small/medium/large/huge)
  • coco_eval_masks_AR: Average Recall
  • Similar metrics for bbox if enabled

compute_synced

Run full evaluation pipeline (sync + accumulate + summarize).
results = evaluator.compute_synced()

Example Usage

Basic Evaluation

from pycocotools.coco import COCO
from sam3.eval.coco_eval import CocoEvaluator

# Load ground truth
coco_gt = COCO("annotations.json")

# Initialize evaluator
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["segm"],
    useCats=False,  # Open-vocabulary
    dump_dir="./predictions",
    postprocessor=my_postprocessor
)

# During evaluation loop
for batch in dataloader:
    outputs = model(batch)
    evaluator.update(outputs, batch["targets"], batch["image_ids"])

# Compute final metrics
results = evaluator.compute_synced()

print(f"Mask AP: {results['coco_eval_masks_AP']:.3f}")
print(f"Mask AP50: {results['coco_eval_masks_AP_50']:.3f}")
print(f"Mask AP75: {results['coco_eval_masks_AP_75']:.3f}")

Distributed Training

import torch.distributed as dist

# Initialize evaluator on all ranks
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["segm"],
    useCats=True,
    dump_dir="./predictions",
    postprocessor=postprocessor
)

# Each rank processes its data
for batch in dataloader:
    outputs = model(batch)
    evaluator.update(outputs, batch["targets"], batch["image_ids"])

# Synchronize across ranks
evaluator.synchronize_between_processes()

# Only rank 0 computes and prints metrics
if dist.get_rank() == 0:
    results = evaluator.summarize()

Box and Mask Evaluation

# Evaluate both boxes and masks
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["bbox", "segm"],
    useCats=True,
    dump_dir=None,
    postprocessor=postprocessor
)

# ... run evaluation ...

results = evaluator.compute_synced()

print(f"Box AP: {results['coco_eval_bbox_AP']:.3f}")
print(f"Mask AP: {results['coco_eval_masks_AP']:.3f}")

Custom Max Detections

# Evaluate with different max detection thresholds
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["segm"],
    useCats=False,
    dump_dir=None,
    postprocessor=postprocessor,
    maxdets=[1, 10, 300]  # Custom thresholds
)

Normalized Areas

# When object areas are normalized by image area
evaluator = CocoEvaluator(
    coco_gt=coco_gt,
    iou_types=["segm"],
    useCats=False,
    dump_dir=None,
    postprocessor=postprocessor,
    use_normalized_areas=True  # Adjusts size buckets
)

# Size buckets become:
# - tiny: [0, 0.001]
# - small: [0.001, 0.01]
# - medium: [0.01, 0.1]
# - large: [0.1, 0.5]
# - huge: [0.5, 0.95]
# - whole_image: [0.95, inf]

Metrics Explained

Average Precision (AP)

AP - Mean AP over IoU thresholds [0.5, 0.95] with step 0.05 AP_50 - AP at IoU threshold 0.5 (loose localization) AP_75 - AP at IoU threshold 0.75 (strict localization) AP_ - AP for specific object sizes:
  • tiny: Very small objects (area < 0.1% of image)
  • small: Small objects (0.1% - 1% of image)
  • medium: Medium objects (1% - 10% of image)
  • large: Large objects (10% - 50% of image)
  • huge: Very large objects (50% - 95% of image)
  • whole_image: Nearly entire image (> 95%)

Average Recall (AR)

AR - Mean recall at max detections threshold AR_50 - AR at maxDets=50 (if maxdets includes 50) AR_75 - AR at maxDets=75 (if maxdets includes 75) AR_ - Recall by object size

Postprocessor Requirements

The postprocessor must implement:
class MyPostprocessor:
    def process_results(self, outputs, targets, image_ids):
        """
        Convert model outputs to COCO prediction format.
        
        Returns:
            dict: {image_id: {"masks": ..., "boxes": ..., "scores": ..., "labels": ...}}
        """
        predictions = {}
        for img_id, output in zip(image_ids, outputs):
            predictions[img_id] = {
                "masks": output["masks"],  # (N, H, W) binary masks
                "boxes": output["boxes"],  # (N, 4) boxes in XYXY format
                "scores": output["scores"],  # (N,) confidence scores
                "labels": output["labels"],  # (N,) category IDs
            }
        return predictions

COCO Format Requirements

Ground Truth

{
  "images": [
    {"id": 1, "width": 640, "height": 480, "file_name": "image.jpg"}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "segmentation": {"size": [480, 640], "counts": "..."},  // RLE
      "area": 5000,
      "bbox": [x, y, w, h],
      "iscrowd": 0
    }
  ],
  "categories": [
    {"id": 1, "name": "person", "supercategory": "person"}
  ]
}

Predictions

Predictions are automatically converted to:
[
  {
    "image_id": 1,
    "category_id": 1,
    "segmentation": {"size": [480, 640], "counts": "..."},
    "score": 0.95,
    "area": 5000
  }
]

Notes

  • Uses pycocotools internally
  • Supports distributed evaluation across multiple GPUs
  • Predictions can be dumped to disk for later analysis
  • Size buckets automatically adjusted for normalized areas
  • Compatible with COCO, LVIS, and custom datasets in COCO format
  • For open-vocabulary tasks, set useCats=False

See Also

Build docs developers (and LLMs) love