PreprocessCfg

Overview

PreprocessCfg is a dataclass that encapsulates all preprocessing parameters for CLIP image transforms. It provides type-safe configuration for image resizing, normalization, and color mode settings.

Class Definition

@dataclass
class PreprocessCfg:
    size: Union[int, Tuple[int, int]] = 224
    mode: str = 'RGB'
    mean: Tuple[float, ...] = OPENAI_DATASET_MEAN
    std: Tuple[float, ...] = OPENAI_DATASET_STD
    interpolation: str = 'bicubic'
    resize_mode: str = 'shortest'
    fill_color: int = 0

Fields

size

Union[int, Tuple[int, int]]

default:"224"

Target size for preprocessed images.

int: Square images (e.g., 224 → 224×224)
Tuple[int, int]: Rectangular images (height, width) (e.g., (384, 224))

Common sizes:

224: ViT-B, ViT-L (base resolution)
336: ViT-L (high resolution)
384: ViT-L/14@336px

mode

str

default:"'RGB'"

Color mode for image conversion. Currently only 'RGB' is supported.Images are automatically converted to RGB with 3 channels.

mean

Tuple[float, ...]

default:"(0.48145466, 0.4578275, 0.40821073)"

Mean values for normalization, one per channel (R, G, B).Default values are from the OPENAI CLIP dataset:

R: 0.48145466
G: 0.4578275
B: 0.40821073

Used in: normalized = (image - mean) / std

std

Tuple[float, ...]

default:"(0.26862954, 0.26130258, 0.27577711)"

Standard deviation values for normalization, one per channel (R, G, B).Default values are from the OPENAI CLIP dataset:

R: 0.26862954
G: 0.26130258
B: 0.27577711

Used in: normalized = (image - mean) / std

interpolation

str

default:"'bicubic'"

Interpolation method for resizing images.Options:

'bicubic': High-quality interpolation (recommended, default for CLIP)
'bilinear': Faster but lower quality
'random': Randomly choose between bicubic/bilinear (training only)

resize_mode

str

default:"'shortest'"

Strategy for resizing images to target size.Options:

'shortest': Resize shortest edge to target size, then center crop
'longest': Resize longest edge to target size, then pad and center crop
'squash': Resize to exact target size (may distort aspect ratio)

fill_color

int

default:"0"

Fill color value (0-255) for padding when using resize_mode='longest'.

0: Black padding (default)
255: White padding
Other values: Gray shades

Properties

num_channels

@property
def num_channels(self) -> int:
    return 3

Returns the number of image channels (always 3 for RGB).

input_size

@property
def input_size(self) -> Tuple[int, int, int]:
    return (self.num_channels,) + to_2tuple(self.size)

Returns the expected input tensor shape: (channels, height, width).

Examples

Create default config

from open_clip import PreprocessCfg

# Use all defaults (224x224, ImageNet normalization)
cfg = PreprocessCfg()
print(cfg.input_size)  # (3, 224, 224)

Custom image size

# Square images
cfg_224 = PreprocessCfg(size=224)
cfg_336 = PreprocessCfg(size=336)

# Rectangular images
cfg_rect = PreprocessCfg(size=(384, 224))  # height, width
print(cfg_rect.input_size)  # (3, 384, 224)

Custom normalization

# Use ImageNet statistics instead of CLIP
from torchvision.transforms import Normalize

cfg = PreprocessCfg(
    size=224,
    mean=(0.485, 0.456, 0.406),
    std=(0.229, 0.224, 0.225)
)

Different resize modes

# Shortest edge resize (default)
cfg_shortest = PreprocessCfg(size=224, resize_mode='shortest')

# Longest edge with padding
cfg_longest = PreprocessCfg(
    size=224,
    resize_mode='longest',
    fill_color=128  # Gray padding
)

# Squash to exact size
cfg_squash = PreprocessCfg(size=224, resize_mode='squash')

High-resolution config

# ViT-L-14@336px configuration
cfg_high_res = PreprocessCfg(
    size=336,
    interpolation='bicubic',
    resize_mode='shortest'
)

Use with image_transform_v2

import open_clip
from PIL import Image

# Create config
preprocess_cfg = open_clip.PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

# Create transform
transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=False
)

# Apply to image
image = Image.open('photo.jpg')
tensor = transform(image)
print(tensor.shape)  # torch.Size([3, 224, 224])

Extract from model

import open_clip

# Get preprocessing config from a model
model, preprocess_cfg, _ = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)

print(f"Size: {preprocess_cfg.size}")
print(f"Mean: {preprocess_cfg.mean}")
print(f"Std: {preprocess_cfg.std}")

Merge configs

from open_clip.transform import merge_preprocess_dict

# Start with base config
base_cfg = PreprocessCfg(size=224)

# Override specific fields
overlay = {'size': 336, 'interpolation': 'bilinear'}
merged = merge_preprocess_dict(base_cfg, overlay)

# Create new config from merged dict
new_cfg = PreprocessCfg(**merged)

Resize Mode Comparison

Mode	Behavior	Use Case
`shortest`	Resize shortest edge, crop center	Default, preserves aspect ratio
`longest`	Resize longest edge, pad to square	When you want to see entire image
`squash`	Resize to exact dimensions	May distort, fastest

Common Configurations

ViT-B/32

PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

ViT-L/14

PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

ViT-L/14@336

PreprocessCfg(
    size=336,
    interpolation='bicubic'
)

ViT-H/14

PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

Notes

Always use PreprocessCfg instead of passing individual parameters to transforms
The config object is returned by create_model_and_transforms()
Mean and std values should match the model’s training data statistics
Use bicubic interpolation for best quality (matches CLIP training)
resize_mode='shortest' is the standard CLIP preprocessing approach

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

Overview

Class Definition

Fields

Properties

num_channels

input_size

Examples

Create default config

Custom image size

Custom normalization

Different resize modes

High-resolution config

Use with image_transform_v2

Extract from model

Merge configs

Resize Mode Comparison

Common Configurations

ViT-B/32

ViT-L/14

ViT-L/14@336

ViT-H/14

Notes

See Also

Build docs developers (and LLMs) love

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

​Overview

​Class Definition

​Fields

​Properties

​num_channels

​input_size

​Examples

​Create default config

​Custom image size

​Custom normalization

​Different resize modes

​High-resolution config

​Use with image_transform_v2

​Extract from model

​Merge configs

​Resize Mode Comparison

​Common Configurations

ViT-B/32

ViT-L/14

ViT-L/14@336

ViT-H/14

​Notes

​See Also

Build docs developers (and LLMs) love

Overview

Class Definition

Fields

Properties

num_channels

input_size

Examples

Create default config

Custom image size

Custom normalization

Different resize modes

High-resolution config

Use with image_transform_v2

Extract from model

Merge configs

Resize Mode Comparison

Common Configurations

Notes

See Also