Skip to main content

Overview

PatchCore evaluation computes both image-level and pixel-level anomaly detection performance on test datasets. The evaluation script loads pretrained models and generates comprehensive metrics including AUROC and PRO scores.

Evaluation Script

Use bin/load_and_evaluate_patchcore.py to evaluate saved models:
python bin/load_and_evaluate_patchcore.py \
  --gpu 0 \
  --seed 0 \
  evaluated_results/IM224_WR50 \
  patch_core_loader \
    -p results/models/mvtec_bottle \
    -p results/models/mvtec_cable \
    --faiss_on_gpu \
  dataset \
    --resize 256 \
    --imagesize 224 \
    -d bottle \
    -d cable \
    mvtec /path/to/mvtec/data
1

Specify output directory

First positional argument: evaluated_results/IM224_WR50 - where results will be saved
2

Configure model loader

Use patch_core_loader with -p flags to specify model paths for each subdataset
3

Configure dataset

Use dataset command with -d flags to specify which subdatasets to evaluate

Complete Evaluation Example

Here’s a complete evaluation script for all MVTec AD categories (from sample_evaluation.sh:1):
#!/bin/bash

datapath=/path/to/data/from/mvtec
loadpath=/path/to/pretrained/patchcore/model

modelfolder=IM320_Ensemble_L2-3_P001_D1024-384_PS-3_AN-1
savefolder=evaluated_results'/'$modelfolder

datasets=('bottle'  'cable'  'capsule'  'carpet'  'grid'  'hazelnut' \
          'leather'  'metal_nut'  'pill' 'screw' 'tile' 'toothbrush' \
          'transistor' 'wood' 'zipper')
          
model_flags=($(for dataset in "${datasets[@]}"; do \
  echo '-p '$loadpath'/'$modelfolder'/models/mvtec_'$dataset; done))
  
dataset_flags=($(for dataset in "${datasets[@]}"; do \
  echo '-d '$dataset; done))

python bin/load_and_evaluate_patchcore.py --gpu 0 --seed 0 $savefolder \
patch_core_loader "${model_flags[@]}" --faiss_on_gpu \
dataset --resize 366 --imagesize 320 "${dataset_flags[@]}" mvtec $datapath

Evaluation Metrics

The evaluation computes three key metrics for each dataset:

Image-Level AUROC

# Compute image-level anomaly detection (source:load_and_evaluate_patchcore.py:148)
auroc = patchcore.metrics.compute_imagewise_retrieval_metrics(
    scores, anomaly_labels
)["auroc"]
Measures the ability to classify entire images as normal or anomalous. Uses the maximum anomaly score across all patches in an image.
Area Under the Receiver Operating Characteristic Curve
  • Ranges from 0 to 1 (1.0 = perfect)
  • Measures discrimination ability across all thresholds
  • Robust to class imbalance
  • Standard metric for anomaly detection

Pixel-Level AUROC (Full)

Computed on all test images including normal images:
# Full pixel-level AUROC (source:load_and_evaluate_patchcore.py:154)
pixel_scores = patchcore.metrics.compute_pixelwise_retrieval_metrics(
    segmentations, masks_gt
)
full_pixel_auroc = pixel_scores["auroc"]
Measures pixel-level localization accuracy across the entire test set.

Pixel-Level AUROC (Anomaly Only)

Computed only on images containing anomalies:
# Select only images with anomalies (source:load_and_evaluate_patchcore.py:160)
sel_idxs = []
for i in range(len(masks_gt)):
    if np.sum(masks_gt[i]) > 0:
        sel_idxs.append(i)
        
pixel_scores = patchcore.metrics.compute_pixelwise_retrieval_metrics(
    [segmentations[i] for i in sel_idxs], 
    [masks_gt[i] for i in sel_idxs]
)
anomaly_pixel_auroc = pixel_scores["auroc"]
Provides a more focused measure of localization performance on defective regions.

Example Output

During evaluation, you’ll see progress and results like this:
Evaluating dataset [mvtec_bottle] (1/15)...
Embedding test data with models (1/1)
Inferring...: 100%|████████████| 83/83 [00:24<00:00,  3.38it/s]
Computing evaluation metrics.
instance_auroc: 1.000
full_pixel_auroc: 0.988
anomaly_pixel_auroc: 0.984

-----

Evaluating dataset [mvtec_cable] (2/15)...
...
The Inferring... progress bar shows prediction progress. Evaluation metrics are computed after all predictions are complete.

Understanding the Evaluation Process

The evaluation pipeline (source:load_and_evaluate_patchcore.py:58):
1

Load models and datasets

Iterate through dataset/model pairs, loading one PatchCore model per subdataset
2

Generate predictions

scores, segmentations, labels_gt, masks_gt = PatchCore.predict(
    dataloaders["testing"]
)
Run inference on all test images to get anomaly scores and segmentation maps
3

Normalize scores

min_scores = scores.min(axis=-1).reshape(-1, 1)
max_scores = scores.max(axis=-1).reshape(-1, 1)
scores = (scores - min_scores) / (max_scores - min_scores)
Normalize to [0, 1] range for consistent thresholding
4

Compute metrics

Calculate AUROC for image-level detection and pixel-level localization
5

Save results

Store metrics in CSV format with per-dataset breakdown

Ensemble Evaluation

For ensemble models, predictions from multiple models are aggregated:
# Aggregate predictions from ensemble members (source:load_and_evaluate_patchcore.py:75)
aggregator = {"scores": [], "segmentations": []}
for i, PatchCore in enumerate(PatchCore_list):
    scores, segmentations, labels_gt, masks_gt = PatchCore.predict(
        dataloaders["testing"]
    )
    aggregator["scores"].append(scores)
    aggregator["segmentations"].append(segmentations)

# Average ensemble predictions after normalization
scores = np.mean(scores, axis=0)
segmentations = np.mean(segmentations, axis=0)
Each ensemble member is normalized independently before averaging. This prevents models with different score ranges from dominating the ensemble.

Results Storage

Evaluation results are saved in the specified output directory:
evaluated_results/IM224_WR50/
├── results.csv                    # Aggregate metrics for all datasets
└── segmentation_images/          # (optional) Visual results
    ├── mvtec_bottle/
    ├── mvtec_cable/
    └── ...
The results.csv file contains:
dataset_name,instance_auroc,full_pixel_auroc,anomaly_pixel_auroc
mvtec_bottle,1.000,0.988,0.984
mvtec_cable,0.993,0.982,0.978
...
mean,0.996,0.984,0.980

Saving Segmentation Images

To visualize predictions, use the --save_segmentation_images flag:
python bin/load_and_evaluate_patchcore.py \
  --gpu 0 \
  --save_segmentation_images \
  ...
This generates overlay images showing:
  • Original input image
  • Ground truth mask (if available)
  • Predicted anomaly heatmap
  • Anomaly score

Expected Performance

State-of-the-art pretrained models achieve:
ModelImage AUROCPixel AUROCPRO Score
WR50-baseline99.2%98.1%94.4%
Ensemble99.6%98.2%94.9%
Results may vary slightly due to hardware differences and software versions, but should be within 0.1-0.3% of reported values.

Next Steps

Load Models

Learn how to load and initialize saved models

Batch Prediction

Run inference on custom image datasets

Build docs developers (and LLMs) love