Signature
Parameters
Model identifier, potentially with schema prefix:
'ViT-B-32': Built-in model name.pretrainedspecifies CLIP weights source.'hf-hub:org/repo': Loads config/weights from HuggingFace Hub.'local-dir:/path/to/folder': Loads config/weights from local directory.
Source for CLIP weights (tag or file path) ONLY if
model_name has no schema.Load the resolved pretrained weights if True, otherwise random init or tower overrides only.
Model precision. Options:
'fp32', 'fp16', 'bf16', 'pure_fp16', 'pure_bf16'.Device to load model on.
If True, JIT compile the model.
Force use of QuickGELU activation in model config.
Force use of custom text encoder architecture.
Override patch dropout value in model config.
Override image size in model config.
Override context length in text config.
Override default image normalization mean values (per channel). Example:
(0.48145466, 0.4578275, 0.40821073).Override default image normalization std values (per channel). Example:
(0.26862954, 0.26130258, 0.27577711).Override default interpolation method for image resizing. Options:
'bicubic', 'bilinear', 'nearest'.Override resize mode for preprocessing. Options:
'squash': Resize to exact dimensions (may distort aspect ratio)'shortest': Resize shortest edge to target size, then crop'longest': Resize longest edge to target size, then crop
Augmentation configuration for training transforms. Can be dict or AugmentationCfg object. Controls random crop, color jitter, etc. If None, uses model defaults.Example dict:
{'scale': (0.9, 1.0), 'ratio': (1.0, 1.0), 'color_jitter': 0.4}Load default base weights for image tower at creation if no CLIP weights loaded.
Load default base weights for text tower at creation if no CLIP weights loaded.
Path to load weights specifically into image tower after creation.
Path to load weights specifically into text tower after creation.
Cache directory for downloads.
If True and model supports it, return dict output.
Use weights_only=True for torch.load (safer).
Additional keyword arguments for model constructor.
Returns
The created model instance.
Image preprocessing transform for training (includes augmentation like random crop, color jitter).
Image preprocessing transform for validation/inference (no augmentation, deterministic).
