Skip to main content

Overview

Creates torchvision transform pipelines for preprocessing images before passing them to CLIP vision encoders. Accepts a PreprocessCfg configuration object for clean, declarative transform configuration.

Function Signature

def image_transform_v2(
    cfg: PreprocessCfg,
    is_train: bool,
    aug_cfg: Optional[Union[Dict[str, Any], AugmentationCfg]] = None,
) -> Compose

Parameters

cfg
PreprocessCfg
required
Preprocessing configuration object containing:
  • size: Target image size (int or tuple)
  • mean: Normalization mean values
  • std: Normalization std values
  • interpolation: Resize interpolation method
  • resize_mode: How to resize images (‘shortest’, ‘longest’, ‘squash’)
  • fill_color: Padding fill color
See PreprocessCfg for details.
is_train
bool
required
Whether to create training (with augmentation) or inference (deterministic) transforms.
aug_cfg
Union[Dict, AugmentationCfg]
default:"None"
Augmentation configuration for training. Only used when is_train=True.See AugmentationCfg for options.

Returns

transform
torchvision.transforms.Compose
Composed transform pipeline that can be applied to PIL Images. Includes:
  • Training: Random crops, color jitter, normalization
  • Inference: Resize, center crop, normalization

Examples

Create inference transform

import open_clip
from PIL import Image

# Create preprocessing config
preprocess_cfg = open_clip.PreprocessCfg(
    size=224,
    mean=(0.48145466, 0.4578275, 0.40821073),
    std=(0.26862954, 0.26130258, 0.27577711),
    interpolation='bicubic',
    resize_mode='shortest'
)

# Create inference transform
transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=False
)

# Apply to image
image = Image.open('cat.jpg')
image_tensor = transform(image)
print(image_tensor.shape)  # torch.Size([3, 224, 224])

Create training transform with augmentation

import open_clip

# Preprocessing config
preprocess_cfg = open_clip.PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

# Augmentation config
aug_cfg = open_clip.AugmentationCfg(
    scale=(0.08, 1.0),
    ratio=(0.75, 1.33),
    color_jitter=(0.4, 0.4, 0.4, 0.1),
    color_jitter_prob=0.8,
    gray_scale_prob=0.2
)

# Create training transform
train_transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=True,
    aug_cfg=aug_cfg
)

Use with model preprocess config

import open_clip
import torch
from PIL import Image

# Load model and get its preprocessing config
model, preprocess_cfg, _ = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k',
    precision='fp16',
    device='cuda'
)

# Create transform from model's config
transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=False
)

# Process image
image = Image.open('photo.jpg')
image_tensor = transform(image).unsqueeze(0).to('cuda')

# Encode
with torch.no_grad():
    image_features = model.encode_image(image_tensor)

Different resize modes

import open_clip

# Shortest edge resize (default)
cfg_shortest = open_clip.PreprocessCfg(size=224, resize_mode='shortest')
transform_shortest = open_clip.image_transform_v2(cfg_shortest, is_train=False)

# Longest edge resize with padding
cfg_longest = open_clip.PreprocessCfg(size=224, resize_mode='longest', fill_color=0)
transform_longest = open_clip.image_transform_v2(cfg_longest, is_train=False)

# Squash to exact size (may distort aspect ratio)
cfg_squash = open_clip.PreprocessCfg(size=(224, 224), resize_mode='squash')
transform_squash = open_clip.image_transform_v2(cfg_squash, is_train=False)

Training with timm augmentations

import open_clip

# Use timm's RandAugment and other advanced augmentations
preprocess_cfg = open_clip.PreprocessCfg(size=224)
aug_cfg = open_clip.AugmentationCfg(
    scale=(0.08, 1.0),
    color_jitter=0.4,
    re_prob=0.25,  # Random erasing probability
    re_count=1,
    use_timm=True  # Enable timm augmentations
)

train_transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=True,
    aug_cfg=aug_cfg
)

Non-square images

import open_clip

# Rectangular image preprocessing
preprocess_cfg = open_clip.PreprocessCfg(
    size=(384, 224),  # height, width
    resize_mode='shortest'
)

transform = open_clip.image_transform_v2(cfg=preprocess_cfg, is_train=False)

Transform Pipeline

Inference Mode (is_train=False)

  1. Resize based on resize_mode:
    • shortest: Resize shortest edge to target size
    • longest: Resize longest edge to target size
    • squash: Resize to exact dimensions
  2. Center Crop (or pad if needed)
  3. Convert to RGB
  4. To Tensor
  5. Normalize with mean/std

Training Mode (is_train=True)

  1. Random Resized Crop with scale and ratio
  2. Convert to RGB
  3. Color Jitter (optional, based on aug_cfg)
  4. Grayscale (optional, based on aug_cfg)
  5. To Tensor
  6. Normalize with mean/std
  7. Random Erasing (optional, if using timm)

Notes

  • For most use cases, use the transforms returned by create_model_and_transforms()
  • PreprocessCfg provides type-safe configuration compared to passing individual parameters
  • Training transforms include random augmentation for better generalization
  • Inference transforms are deterministic and optimized for consistent preprocessing
  • Images are automatically converted to RGB mode
  • Normalization uses ImageNet statistics by default

See Also

Build docs developers (and LLMs) love