Skip to main content

Overview

PatchCore supports a wide range of pretrained CNN and Transformer backbone architectures for feature extraction. All backbones are pretrained on ImageNet and can be combined in ensembles for improved performance.

Selecting a Backbone

Specify a backbone using the -b or --backbone flag:
python bin/run_patchcore.py ... \
  patch_core -b wideresnet50 -le layer2 -le layer3 ...
For ensemble models, specify multiple backbones:
patch_core -b wideresnet101 -b resnext101 -b densenet201 \
  -le 0.layer2 -le 0.layer3 -le 1.layer2 -le 1.layer3 \
  -le 2.features.denseblock2 -le 2.features.denseblock3 ...

Available Backbones

All 40+ supported architectures are defined in src/patchcore/backbones.py.

ResNet Family

Standard ResNet architectures from torchvision and timm.
Backbone NameArchitectureLayersParametersSource
resnet50ResNet-505025.6Mtorchvision
resnet101ResNet-10110144.5Mtorchvision
resnext101ResNeXt-101-32x8d10188.8Mtorchvision
resnet200ResNet-200200~64Mtimm
resnest50ResNeSt-50d5027.5Mtimm
Common Feature Layers: layer1, layer2, layer3, layer4

Wide ResNet

Recommended for baseline models - excellent balance of performance and efficiency.
Backbone NameArchitectureWidth MultiplierParameters
wideresnet50Wide ResNet-50-22x68.9M
wideresnet101Wide ResNet-101-22x126.9M
Feature Layers: layer1, layer2, layer3, layer4 Best Practice:
-b wideresnet50 -le layer2 -le layer3
Wide ResNet-50 (wideresnet50) is the recommended baseline backbone, achieving 99.2% AUROC on MVTec AD.

ResNetV2 (BiT)

Big Transfer (BiT) models pretrained on ImageNet-21k or ImageNet with better training.
Backbone NameArchitecturePretrainingParameters
resnetv2_50_bitResNetV2-50x3ImageNet BiT~100M
resnetv2_50_21kResNetV2-50x3ImageNet-21k~100M
resnetv2_101_bitResNetV2-101x3ImageNet BiT~150M
resnetv2_101_21kResNetV2-101x3ImageNet-21k~150M
resnetv2_152_bitResNetV2-152x4ImageNet BiT~212M
resnetv2_152_21kResNetV2-152x4ImageNet-21k~212M
resnetv2_152_384ResNetV2-152x2 TeacherImageNet 384px~236M
resnetv2_101ResNetV2-101ImageNet44.5M
Feature Layers: Named stages vary by model - inspect with model.named_modules()

VGG Networks

Classic VGG architectures - deeper but less efficient than ResNets.
Backbone NameArchitectureBatch NormParameters
vgg11VGG-11No132.9M
vgg19VGG-19No143.7M
vgg19_bnVGG-19Yes143.7M
Feature Layers: features[X] where X is the layer index

DenseNet

Densely connected networks with efficient parameter usage.
Backbone NameArchitectureGrowth RateParameters
densenet121DenseNet-121328.0M
densenet201DenseNet-2013220.0M
Feature Layers: features.denseblock1, features.denseblock2, features.denseblock3, features.denseblock4 Example for Ensemble:
-b densenet201 -le 2.features.denseblock2 -le 2.features.denseblock3

EfficientNet

Scaled efficient architectures with compound scaling.
Backbone NameArchitectureInput SizeParameters
efficientnet_b1EfficientNet-B1240x2407.8M
efficientnet_b3EfficientNet-B3300x30012.0M
efficientnet_b5EfficientNet-B5456x45630.0M
efficientnet_b7EfficientNet-B7600x60066.0M
efficientnet_b3aEfficientNet-B3a320x32012.0M
efficientnetv2_mEfficientNetV2-M480x48054.1M
efficientnetv2_lEfficientNetV2-L480x480119.5M
Feature Layers: blocks[X] where X is the block index (0-6)
EfficientNet models are optimized for different input resolutions. Ensure your image preprocessing matches the expected input size.

Vision Transformers (ViT)

Transformer-based architectures for vision tasks.

Standard ViT

Backbone NameArchitecturePatch SizeParameters
vit_smallViT-Small16x1622M
vit_baseViT-Base16x1686M
vit_largeViT-Large16x16304M
vit_r50ViT-Large + ResNet50hybrid329M

DeiT (Data-efficient ViT)

Backbone NameArchitectureDistillationParameters
vit_deit_baseDeiT-BaseNo86M
vit_deit_distilledDeiT-BaseYes87M

Swin Transformer

Backbone NameArchitectureWindow SizeParameters
vit_swin_baseSwin-Base7x788M
vit_swin_largeSwin-Large7x7197M
Feature Layers: Varies by architecture - typically blocks[X] or hierarchical stages

MNASNet

Mobile Neural Architecture Search optimized networks.
Backbone NameArchitectureMultiplierParameters
mnasnet_100MNASNet1.0x4.4M
mnasnet_a1MNASNet-A11.0x3.9M
mnasnet_b1MNASNet-B11.0x4.4M
Feature Layers: layers[X]

Inception

Backbone NameArchitectureParameters
inception_v4Inception V442.7M
Feature Layers: Named mixed layers like features.mixed_6a, features.mixed_7a

Legacy Networks

Backbone NameArchitectureParameters
alexnetAlexNet61.1M
bninceptionBN-Inception11.3M

Feature Layer Selection

Extract features from specific layers using -le or --layer flags.

Single Backbone

-b wideresnet50 -le layer2 -le layer3
Extracts and aggregates features from both layer2 and layer3.

Multiple Backbones (Ensemble)

Prefix layer names with backbone index (0-based):
-b wideresnet101 -b resnext101 -b densenet201 \
-le 0.layer2 -le 0.layer3 \
-le 1.layer2 -le 1.layer3 \
-le 2.features.denseblock2 -le 2.features.denseblock3
  • 0.* refers to first backbone (wideresnet101)
  • 1.* refers to second backbone (resnext101)
  • 2.* refers to third backbone (densenet201)

Best Single Backbone

-b wideresnet50 -le layer2 -le layer3
Performance: 99.2% Image AUROC, 98.1% Pixel AUROC

Best Ensemble

-b wideresnet101 -b resnext101 -b densenet201 \
-le 0.layer2 -le 0.layer3 \
-le 1.layer2 -le 1.layer3 \
-le 2.features.denseblock2 -le 2.features.denseblock3
Performance: 99.6% Image AUROC, 98.2% Pixel AUROC

Memory-Efficient

-b resnet50 -le layer2 -le layer3
Lower memory footprint with competitive performance.

High-Resolution Images

-b wideresnet50 -le layer2 -le layer3
With dataset flags: --resize 366 --imagesize 320

Implementation Details

Backbone Loading

Backbones are loaded from the backbones.py module:
# src/patchcore/backbones.py
_BACKBONES = {
    "wideresnet50": "models.wide_resnet50_2(pretrained=True)",
    "vit_base": 'timm.create_model("vit_base_patch16_224", pretrained=True)',
    # ... more backbones
}

def load(name):
    return eval(_BACKBONES[name])

Sources

  • torchvision.models: Standard PyTorch models (ResNet, VGG, WideResNet, etc.)
  • timm: PyTorch Image Models library (ViT, EfficientNet, DeiT, Swin, etc.)
  • pretrainedmodels: Legacy models (BN-Inception)

Inspecting Feature Layers

To find available layers for a backbone:
import torch
from patchcore import backbones

model = backbones.load('wideresnet50')
for name, module in model.named_modules():
    print(name)
Common patterns:
  • ResNet family: layer1, layer2, layer3, layer4
  • DenseNet: features.denseblock1 through features.denseblock4
  • EfficientNet: blocks[0] through blocks[6]
  • ViT: blocks[0] through blocks[11] (base)

Performance Considerations

GPU Memory Usage

Approximate memory requirements (224x224 images, batch size 1):
  • ResNet-50: ~4GB
  • Wide ResNet-50: ~6GB
  • Wide ResNet-101: ~8GB
  • Ensemble (3 networks): ~10-11GB
  • ViT-Large: ~8GB
  • EfficientNet-B7: ~12GB

Inference Speed

Relative speeds on RTX 3090 (images/sec):
  • ResNet-50: ~100
  • Wide ResNet-50: ~80
  • DenseNet-201: ~60
  • EfficientNet-B5: ~50
  • ViT-Base: ~70
  • Ensemble (3 networks): ~30

Accuracy vs. Efficiency

Recommended backbones by use case:
  • Highest accuracy: Ensemble of WideResNet101 + ResNeXt101 + DenseNet201
  • Best balance: WideResNet-50 (single backbone)
  • Fastest: ResNet-50
  • Most memory-efficient: MNASNet-100 or EfficientNet-B1
For production deployments, WideResNet-50 offers the best trade-off between accuracy (99.2% AUROC) and computational efficiency.

Build docs developers (and LLMs) love