image_transform_v2

Overview

Creates torchvision transform pipelines for preprocessing images before passing them to CLIP vision encoders. Accepts a PreprocessCfg configuration object for clean, declarative transform configuration.

Function Signature

def image_transform_v2(
    cfg: PreprocessCfg,
    is_train: bool,
    aug_cfg: Optional[Union[Dict[str, Any], AugmentationCfg]] = None,
) -> Compose

Parameters

cfg

PreprocessCfg

required

Preprocessing configuration object containing:

size: Target image size (int or tuple)
mean: Normalization mean values
std: Normalization std values
interpolation: Resize interpolation method
resize_mode: How to resize images (‘shortest’, ‘longest’, ‘squash’)
fill_color: Padding fill color

See PreprocessCfg for details.

is_train

bool

required

Whether to create training (with augmentation) or inference (deterministic) transforms.

aug_cfg

Union[Dict, AugmentationCfg]

default:"None"

Augmentation configuration for training. Only used when is_train=True.See AugmentationCfg for options.

Returns

transform

torchvision.transforms.Compose

Composed transform pipeline that can be applied to PIL Images. Includes:

Training: Random crops, color jitter, normalization
Inference: Resize, center crop, normalization

Examples

Create inference transform

import open_clip
from PIL import Image

# Create preprocessing config
preprocess_cfg = open_clip.PreprocessCfg(
    size=224,
    mean=(0.48145466, 0.4578275, 0.40821073),
    std=(0.26862954, 0.26130258, 0.27577711),
    interpolation='bicubic',
    resize_mode='shortest'
)

# Create inference transform
transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=False
)

# Apply to image
image = Image.open('cat.jpg')
image_tensor = transform(image)
print(image_tensor.shape)  # torch.Size([3, 224, 224])

Create training transform with augmentation

import open_clip

# Preprocessing config
preprocess_cfg = open_clip.PreprocessCfg(
    size=224,
    interpolation='bicubic'
)

# Augmentation config
aug_cfg = open_clip.AugmentationCfg(
    scale=(0.08, 1.0),
    ratio=(0.75, 1.33),
    color_jitter=(0.4, 0.4, 0.4, 0.1),
    color_jitter_prob=0.8,
    gray_scale_prob=0.2
)

# Create training transform
train_transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=True,
    aug_cfg=aug_cfg
)

Use with model preprocess config

import open_clip
import torch
from PIL import Image

# Load model and get its preprocessing config
model, preprocess_cfg, _ = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k',
    precision='fp16',
    device='cuda'
)

# Create transform from model's config
transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=False
)

# Process image
image = Image.open('photo.jpg')
image_tensor = transform(image).unsqueeze(0).to('cuda')

# Encode
with torch.no_grad():
    image_features = model.encode_image(image_tensor)

Different resize modes

import open_clip

# Shortest edge resize (default)
cfg_shortest = open_clip.PreprocessCfg(size=224, resize_mode='shortest')
transform_shortest = open_clip.image_transform_v2(cfg_shortest, is_train=False)

# Longest edge resize with padding
cfg_longest = open_clip.PreprocessCfg(size=224, resize_mode='longest', fill_color=0)
transform_longest = open_clip.image_transform_v2(cfg_longest, is_train=False)

# Squash to exact size (may distort aspect ratio)
cfg_squash = open_clip.PreprocessCfg(size=(224, 224), resize_mode='squash')
transform_squash = open_clip.image_transform_v2(cfg_squash, is_train=False)

Training with timm augmentations

import open_clip

# Use timm's RandAugment and other advanced augmentations
preprocess_cfg = open_clip.PreprocessCfg(size=224)
aug_cfg = open_clip.AugmentationCfg(
    scale=(0.08, 1.0),
    color_jitter=0.4,
    re_prob=0.25,  # Random erasing probability
    re_count=1,
    use_timm=True  # Enable timm augmentations
)

train_transform = open_clip.image_transform_v2(
    cfg=preprocess_cfg,
    is_train=True,
    aug_cfg=aug_cfg
)

Non-square images

import open_clip

# Rectangular image preprocessing
preprocess_cfg = open_clip.PreprocessCfg(
    size=(384, 224),  # height, width
    resize_mode='shortest'
)

transform = open_clip.image_transform_v2(cfg=preprocess_cfg, is_train=False)

Transform Pipeline

Inference Mode (`is_train=False`)

Resize based on resize_mode:
- shortest: Resize shortest edge to target size
- longest: Resize longest edge to target size
- squash: Resize to exact dimensions
Center Crop (or pad if needed)
Convert to RGB
To Tensor
Normalize with mean/std

Training Mode (`is_train=True`)

Random Resized Crop with scale and ratio
Convert to RGB
Color Jitter (optional, based on aug_cfg)
Grayscale (optional, based on aug_cfg)
To Tensor
Normalize with mean/std
Random Erasing (optional, if using timm)

Notes

For most use cases, use the transforms returned by create_model_and_transforms()
PreprocessCfg provides type-safe configuration compared to passing individual parameters
Training transforms include random augmentation for better generalization
Inference transforms are deterministic and optimized for consistent preprocessing
Images are automatically converted to RGB mode
Normalization uses ImageNet statistics by default

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

image_transform_v2

Overview

Function Signature

Parameters

Returns

Examples

Create inference transform

Create training transform with augmentation

Use with model preprocess config

Different resize modes

Training with timm augmentations

Non-square images

Transform Pipeline

Inference Mode (`is_train=False`)

Training Mode (`is_train=True`)

Notes

See Also

Build docs developers (and LLMs) love

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

​Overview

​Function Signature

​Parameters

​Returns

​Examples

​Create inference transform

​Create training transform with augmentation

​Use with model preprocess config

​Different resize modes

​Training with timm augmentations

​Non-square images

​Transform Pipeline

​Inference Mode (is_train=False)

​Training Mode (is_train=True)

​Notes

​See Also

Build docs developers (and LLMs) love

Overview

Function Signature

Parameters

Returns

Examples

Create inference transform

Create training transform with augmentation

Use with model preprocess config

Different resize modes

Training with timm augmentations

Non-square images

Transform Pipeline

Inference Mode (`is_train=False`)

Training Mode (`is_train=True`)

Notes

See Also