Skip to main content

Overview

PreprocessCfg is a dataclass that encapsulates all preprocessing parameters for CLIP image transforms. It provides type-safe configuration for image resizing, normalization, and color mode settings.

Class Definition

@dataclass
class PreprocessCfg:
    size: Union[int, Tuple[int, int]] = 224
    mode: str = 'RGB'
    mean: Tuple[float, ...] = OPENAI_DATASET_MEAN
    std: Tuple[float, ...] = OPENAI_DATASET_STD
    interpolation: str = 'bicubic'
    resize_mode: str = 'shortest'
    fill_color: int = 0

Fields

size
Union[int, Tuple[int, int]]
default:"224"
Target size for preprocessed images.
  • int: Square images (e.g., 224 → 224×224)
  • Tuple[int, int]: Rectangular images (height, width) (e.g., (384, 224))
Common sizes:
  • 224: ViT-B, ViT-L (base resolution)
  • 336: ViT-L (high resolution)
  • 384: ViT-L/14@336px
mode
str
default:"'RGB'"
Color mode for image conversion. Currently only 'RGB' is supported.Images are automatically converted to RGB with 3 channels.
mean
Tuple[float, ...]
default:"(0.48145466, 0.4578275, 0.40821073)"
Mean values for normalization, one per channel (R, G, B).Default values are from the OPENAI CLIP dataset:
  • R: 0.48145466
  • G: 0.4578275
  • B: 0.40821073
Used in: normalized = (image - mean) / std
std
Tuple[float, ...]
default:"(0.26862954, 0.26130258, 0.27577711)"
Standard deviation values for normalization, one per channel (R, G, B).Default values are from the OPENAI CLIP dataset:
  • R: 0.26862954
  • G: 0.26130258
  • B: 0.27577711
Used in: normalized = (image - mean) / std
interpolation
str
default:"'bicubic'"
Interpolation method for resizing images.Options:
  • 'bicubic': High-quality interpolation (recommended, default for CLIP)
  • 'bilinear': Faster but lower quality
  • 'random': Randomly choose between bicubic/bilinear (training only)
resize_mode
str
default:"'shortest'"
Strategy for resizing images to target size.Options:
  • 'shortest': Resize shortest edge to target size, then center crop
  • 'longest': Resize longest edge to target size, then pad and center crop
  • 'squash': Resize to exact target size (may distort aspect ratio)
fill_color
int
default:"0"
Fill color value (0-255) for padding when using resize_mode='longest'.
  • 0: Black padding (default)
  • 255: White padding
  • Other values: Gray shades

Properties

num_channels

@property
def num_channels(self) -> int:
    return 3
Returns the number of image channels (always 3 for RGB).

input_size

@property
def input_size(self) -> Tuple[int, int, int]:
    return (self.num_channels,) + to_2tuple(self.size)
Returns the expected input tensor shape: (channels, height, width).

Examples

Create default config

from open_clip import PreprocessCfg

# Use all defaults (224x224, ImageNet normalization)
cfg = PreprocessCfg()
print(cfg.input_size)  # (3, 224, 224)

Custom image size

# Square images
cfg_224 = PreprocessCfg(size=224)
cfg_336 = PreprocessCfg(size=336)

# Rectangular images
cfg_rect = PreprocessCfg(size=(384, 224))  # height, width
print(cfg_rect.input_size)  # (3, 384, 224)

Custom normalization

# Use ImageNet statistics instead of CLIP
from torchvision.transforms import Normalize

cfg = PreprocessCfg(
    size=224,
    mean=(0.485, 0.456, 0.406),
    std=(0.229, 0.224, 0.225)
)

Different resize modes

# Shortest edge resize (default)
cfg_shortest = PreprocessCfg(size=224, resize_mode='shortest')

# Longest edge with padding
cfg_longest = PreprocessCfg(
    size=224,
    resize_mode='longest',
    fill_color=128  # Gray padding
)

# Squash to exact size
cfg_squash = PreprocessCfg(size=224, resize_mode='squash')

High-resolution config

# ViT-L-14@336px configuration
cfg_high_res = PreprocessCfg(
    size=336,
    interpolation='bicubic',
    resize_mode='shortest'
)

Use with image_transform_v2

import open_clip
from PIL import Image

# Create config
preprocess_cfg = open_clip.PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

# Create transform
transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=False
)

# Apply to image
image = Image.open('photo.jpg')
tensor = transform(image)
print(tensor.shape)  # torch.Size([3, 224, 224])

Extract from model

import open_clip

# Get preprocessing config from a model
model, preprocess_cfg, _ = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)

print(f"Size: {preprocess_cfg.size}")
print(f"Mean: {preprocess_cfg.mean}")
print(f"Std: {preprocess_cfg.std}")

Merge configs

from open_clip.transform import merge_preprocess_dict

# Start with base config
base_cfg = PreprocessCfg(size=224)

# Override specific fields
overlay = {'size': 336, 'interpolation': 'bilinear'}
merged = merge_preprocess_dict(base_cfg, overlay)

# Create new config from merged dict
new_cfg = PreprocessCfg(**merged)

Resize Mode Comparison

ModeBehaviorUse Case
shortestResize shortest edge, crop centerDefault, preserves aspect ratio
longestResize longest edge, pad to squareWhen you want to see entire image
squashResize to exact dimensionsMay distort, fastest

Common Configurations

ViT-B/32

PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

ViT-L/14

PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

ViT-L/14@336

PreprocessCfg(
    size=336,
    interpolation='bicubic'
)

ViT-H/14

PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

Notes

  • Always use PreprocessCfg instead of passing individual parameters to transforms
  • The config object is returned by create_model_and_transforms()
  • Mean and std values should match the model’s training data statistics
  • Use bicubic interpolation for best quality (matches CLIP training)
  • resize_mode='shortest' is the standard CLIP preprocessing approach

See Also

Build docs developers (and LLMs) love