OpenCLIP provides a comprehensive collection of pretrained CLIP models trained on various datasets at different scales.
Listing Available Models
List All Pretrained Models
import open_clip
# Get all pretrained model/tag combinations
pretrained = open_clip.list_pretrained()
for model_name, tag in pretrained:
print(f"{model_name}:{tag}")
# Get as formatted strings
pretrained_str = open_clip.list_pretrained(as_str=True)
# Returns: ['RN50:openai', 'ViT-B-32:laion2b_s34b_b79k', ...]
List Model Architectures
import open_clip
# Get all available model architectures
architectures = open_clip.list_models()
print(architectures)
# ['RN50', 'RN101', 'RN50x4', 'ViT-B-32', 'ViT-L-14', ...]
Query Specific Models
from open_clip import list_pretrained_tags_by_model, list_pretrained_models_by_tag
# Get all pretrained tags for a specific model
tags = list_pretrained_tags_by_model('ViT-B-32')
print(tags)
# ['openai', 'laion400m_e31', 'laion2b_s34b_b79k', 'datacomp_xl_s13b_b90k', ...]
# Get all models with a specific tag
models = list_pretrained_models_by_tag('openai')
print(models)
# ['RN50', 'RN101', 'ViT-B-32', 'ViT-L-14', ...]
Model Zoo Overview
ViT models provide excellent accuracy-efficiency tradeoffs:
| Model | Parameters | FLOPs | Best Pretrained Tag | ImageNet Zero-Shot |
|---|
| ViT-B-32 | 88M | 4.4G | datacomp_xl_s13b_b90k | 72.8% |
| ViT-B-16 | 86M | 17.6G | datacomp_xl_s13b_b90k | 73.5% |
| ViT-L-14 | 304M | 81.1G | datacomp_xl_s13b_b90k | 79.2% |
| ViT-H-14 | 632M | 169.1G | laion2b_s32b_b79k | 78.0% |
| ViT-g-14 | 1.01B | 257.4G | laion2b_s34b_b88k | 76.6% |
| ViT-bigG-14 | 1.84B | 571.3G | laion2b_s39b_b160k | 80.1% |
ResNet Models
Classic CNN-based architectures:
| Model | Parameters | ImageNet Zero-Shot | Best Tag |
|---|
| RN50 | 38M | 59.6% | openai |
| RN101 | 56M | 62.3% | openai |
| RN50x4 | 87M | 68.3% | openai |
| RN50x16 | 167M | 73.2% | openai |
| RN50x64 | 623M | 76.6% | openai |
ConvNeXt Models
Modern CNN architectures with competitive performance:
| Model | Parameters | ImageNet Zero-Shot | Best Tag |
|---|
| ConvNeXt-Base | 88M | 71.5% | laion2b_s13b_b82k_augreg |
| ConvNeXt-Base-W | 89M | 72.1% | laion2b_s13b_b82k_augreg |
| ConvNeXt-Large-D | 200M | 76.9% | laion2b_s29b_b131k_ft |
| ConvNeXt-XXLarge | 846M | 79.5% | laion2b_s34b_b82k_augreg |
SigLIP Models
Models trained with SigLIP loss function:
| Model | Resolution | ImageNet Zero-Shot | Tag |
|---|
| ViT-B-16-SigLIP | 256 | 76.9% | webli |
| ViT-B-16-SigLIP | 384 | 78.4% | webli |
| ViT-L-16-SigLIP | 256 | 79.8% | webli |
| ViT-L-16-SigLIP | 384 | 81.5% | webli |
| ViT-SO400M-14-SigLIP | 224 | 82.0% | webli |
| ViT-SO400M-14-SigLIP | 384 | 83.1% | webli |
EVA Models
State-of-the-art EVA-CLIP models:
| Model | ImageNet Zero-Shot | Tag |
|---|
| EVA02-B-16 | 74.7% | merged2b_s8b_b131k |
| EVA02-L-14 | 79.8% | merged2b_s4b_b131k |
| EVA02-L-14-336 | 80.4% | merged2b_s6b_b61k |
| EVA02-E-14 | 81.9% | laion2b_s4b_b115k |
| EVA02-E-14-plus | 82.7% | laion2b_s9b_b144k |
OpenAI WIT
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-B-32',
pretrained='openai'
)
- Dataset: WebImageText (WIT) - 400M image-text pairs
- Models: Original OpenAI CLIP models
- Tags:
openai
LAION Datasets
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14',
pretrained='laion2b_s32b_b82k'
)
- LAION-400M: 400M English image-text pairs
- LAION-2B: 2B English image-text pairs
- LAION-5B: 5B multilingual image-text pairs
- Tags:
laion400m_*, laion2b_*, laion5b_*
DataComp
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14',
pretrained='datacomp_xl_s13b_b90k'
)
- DataComp-XL: 1.4B filtered image-text pairs
- DataComp-L: 1.0B pairs
- DataComp-M: 128M pairs
- DataComp-S: 13M pairs
- Tags:
datacomp_xl_*, datacomp_l_*, datacomp_m_*, datacomp_s_*
WebLI (SigLIP)
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-SO400M-14-SigLIP-384',
pretrained='webli'
)
- Dataset: Google’s WebLI dataset
- Models: SigLIP models with sigmoid loss
- Tags:
webli
Model Selection Guide
Speed Priority
Accuracy Priority
Balanced
Fastest inference:# Small and fast
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-B-32', # 4.4 GFLOPs
pretrained='laion2b_s34b_b79k'
)
- ViT-B-32: Best speed/accuracy tradeoff
- RN50: Classic CNN option
- MobileCLIP-S1: Ultra-fast mobile deployment
Best zero-shot accuracy:# State-of-the-art accuracy
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-bigG-14', # 80.1% ImageNet
pretrained='laion2b_s39b_b160k'
)
- ViT-bigG-14: 80.1% ImageNet zero-shot
- ViT-SO400M-14-SigLIP-384: 83.1% (SigLIP)
- EVA02-E-14-plus: 82.7% (EVA-CLIP)
Good balance of speed and accuracy:# Recommended for most use cases
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14', # 79.2% ImageNet
pretrained='datacomp_xl_s13b_b90k'
)
- ViT-L-14: Excellent performance at reasonable cost
- ViT-B-16: Good accuracy, moderate speed
- ConvNeXt-Base: CNN alternative
Loading from HuggingFace Hub
Many models are available on HuggingFace Hub:
import open_clip
# Direct HuggingFace loading
model, _, preprocess = open_clip.create_model_and_transforms(
'hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K',
device='cuda'
)
# Alternative: use standard loading
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14',
pretrained='datacomp_xl_s13b_b90k',
device='cuda'
)
Popular HuggingFace Models
laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
apple/DFN5B-CLIP-ViT-H-14
timm/ViT-SO400M-14-SigLIP-384
HuggingFace models automatically download config and weights. The hf-hub: prefix explicitly specifies HuggingFace as the source.
Multilingual Models
Models trained on multilingual datasets:
# XLM-RoBERTa based models
model, _, preprocess = open_clip.create_model_and_transforms(
'xlm-roberta-base-ViT-B-32',
pretrained='laion5b_s13b_b90k',
device='cuda'
)
# Supports 100+ languages
tokenizer = open_clip.get_tokenizer('xlm-roberta-base-ViT-B-32')
Available multilingual models:
xlm-roberta-base-ViT-B-32
xlm-roberta-large-ViT-H-14
ViT-B-16-SigLIP-i18n-256 (SigLIP multilingual)
Specialized Models
CoCa (Captioning)
Models with generative capabilities:
model, _, preprocess = open_clip.create_model_and_transforms(
'coca_ViT-L-14',
pretrained='mscoco_finetuned_laion2b_s13b_b90k'
)
# Generate captions
import torch
from PIL import Image
im = preprocess(Image.open("cat.jpg")).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(im)
print(open_clip.decode(generated[0]))
DFN Models
Apple’s Distilled Feature Networks:
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-H-14',
pretrained='dfn5b', # 83.4% ImageNet
device='cuda'
)
For detailed zero-shot performance across 38 datasets, see the full results CSV.
Benchmark Datasets
- ImageNet: Primary zero-shot benchmark
- ImageNet variants: -A, -R, -Sketch, -V2
- Object recognition: CIFAR-10/100, STL-10, Food101
- Fine-grained: Flowers102, Pets, Cars, Aircraft
- Scene: SUN397, Places365
- Action: UCF101, Kinetics700
Example: Finding Best Model
import open_clip
# Find all ViT-L-14 pretrained versions
tags = open_clip.list_pretrained_tags_by_model('ViT-L-14')
print(f"Available ViT-L-14 models: {tags}")
# Load the best performing one
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14',
pretrained='datacomp_xl_s13b_b90k', # 79.2% ImageNet
device='cuda',
precision='fp16'
)
model.eval()
print(f"Model loaded with {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")