Multi-scale feature extraction with pretrained backbones
PatchCore leverages pretrained CNN backbones to extract rich, hierarchical features from images. The feature extraction process is central to the algorithm’s success, enabling it to capture both semantic and spatial information at multiple scales.
The NetworkFeatureAggregator class efficiently extracts features from multiple backbone layers:
common.py
class NetworkFeatureAggregator(torch.nn.Module): """Efficient extraction of network features.""" def __init__(self, backbone, layers_to_extract_from, device): super(NetworkFeatureAggregator, self).__init__() """Extraction of network features. Runs a network only to the last layer of the list of layers where network features should be extracted from. Args: backbone: torchvision.model layers_to_extract_from: [list of str] """ self.layers_to_extract_from = layers_to_extract_from self.backbone = backbone self.device = device # ... setup forward hooks
The core feature embedding process happens in the _embed() method:
patchcore.py
def _embed(self, images, detach=True, provide_patch_shapes=False): """Returns feature embeddings for images.""" def _detach(features): if detach: return [x.detach().cpu().numpy() for x in features] return features _ = self.forward_modules["feature_aggregator"].eval() with torch.no_grad(): features = self.forward_modules["feature_aggregator"](images) features = [features[layer] for layer in self.layers_to_extract_from] # Convert to patches features = [ self.patch_maker.patchify(x, return_spatial_info=True) for x in features ] patch_shapes = [x[1] for x in features] features = [x[0] for x in features] ref_num_patches = patch_shapes[0]
Bilinear interpolation is used to resize lower-resolution features to match the highest resolution layer. This preserves spatial correspondence across scales.
The Aggregator combines features from all layers into a single vector:
common.py
class Aggregator(torch.nn.Module): def __init__(self, target_dim): super(Aggregator, self).__init__() self.target_dim = target_dim def forward(self, features): """Returns reshaped and average pooled features.""" # batchsize x number_of_layers x input_dim -> batchsize x target_dim features = features.reshape(len(features), 1, -1) features = F.adaptive_avg_pool1d(features, self.target_dim) return features.reshape(len(features), -1)
The aggregator uses adaptive average pooling, which automatically handles varying input dimensions and produces a fixed-size output.
Here’s how feature dimensions transform through the pipeline:
Input Image: [B, 3, 224, 224] | v (Backbone layer2)[B, 512, 56, 56] (WideResNet50) | v (Patchify with patchsize=3)[B*56*56, 512, 3, 3] | v (Preprocessing to 1024)[B*56*56, 1024] | v (Aggregator to 1024)[B*56*56, 1024] <- Final patch embeddings