Pretrained Models

OpenCLIP provides a comprehensive collection of pretrained CLIP models trained on various datasets at different scales.

Listing Available Models

List All Pretrained Models

import open_clip

# Get all pretrained model/tag combinations
pretrained = open_clip.list_pretrained()
for model_name, tag in pretrained:
    print(f"{model_name}:{tag}")

# Get as formatted strings
pretrained_str = open_clip.list_pretrained(as_str=True)
# Returns: ['RN50:openai', 'ViT-B-32:laion2b_s34b_b79k', ...]

List Model Architectures

import open_clip

# Get all available model architectures
architectures = open_clip.list_models()
print(architectures)
# ['RN50', 'RN101', 'RN50x4', 'ViT-B-32', 'ViT-L-14', ...]

Query Specific Models

from open_clip import list_pretrained_tags_by_model, list_pretrained_models_by_tag

# Get all pretrained tags for a specific model
tags = list_pretrained_tags_by_model('ViT-B-32')
print(tags)
# ['openai', 'laion400m_e31', 'laion2b_s34b_b79k', 'datacomp_xl_s13b_b90k', ...]

# Get all models with a specific tag
models = list_pretrained_models_by_tag('openai')
print(models)
# ['RN50', 'RN101', 'ViT-B-32', 'ViT-L-14', ...]

Model Zoo Overview

Vision Transformer (ViT) Models

ViT models provide excellent accuracy-efficiency tradeoffs:

Model	Parameters	FLOPs	Best Pretrained Tag	ImageNet Zero-Shot
ViT-B-32	88M	4.4G	datacomp_xl_s13b_b90k	72.8%
ViT-B-16	86M	17.6G	datacomp_xl_s13b_b90k	73.5%
ViT-L-14	304M	81.1G	datacomp_xl_s13b_b90k	79.2%
ViT-H-14	632M	169.1G	laion2b_s32b_b79k	78.0%
ViT-g-14	1.01B	257.4G	laion2b_s34b_b88k	76.6%
ViT-bigG-14	1.84B	571.3G	laion2b_s39b_b160k	80.1%

ResNet Models

Classic CNN-based architectures:

Model	Parameters	ImageNet Zero-Shot	Best Tag
RN50	38M	59.6%	openai
RN101	56M	62.3%	openai
RN50x4	87M	68.3%	openai
RN50x16	167M	73.2%	openai
RN50x64	623M	76.6%	openai

ConvNeXt Models

Modern CNN architectures with competitive performance:

Model	Parameters	ImageNet Zero-Shot	Best Tag
ConvNeXt-Base	88M	71.5%	laion2b_s13b_b82k_augreg
ConvNeXt-Base-W	89M	72.1%	laion2b_s13b_b82k_augreg
ConvNeXt-Large-D	200M	76.9%	laion2b_s29b_b131k_ft
ConvNeXt-XXLarge	846M	79.5%	laion2b_s34b_b82k_augreg

SigLIP Models

Models trained with SigLIP loss function:

Model	Resolution	ImageNet Zero-Shot	Tag
ViT-B-16-SigLIP	256	76.9%	webli
ViT-B-16-SigLIP	384	78.4%	webli
ViT-L-16-SigLIP	256	79.8%	webli
ViT-L-16-SigLIP	384	81.5%	webli
ViT-SO400M-14-SigLIP	224	82.0%	webli
ViT-SO400M-14-SigLIP	384	83.1%	webli

EVA Models

State-of-the-art EVA-CLIP models:

Model	ImageNet Zero-Shot	Tag
EVA02-B-16	74.7%	merged2b_s8b_b131k
EVA02-L-14	79.8%	merged2b_s4b_b131k
EVA02-L-14-336	80.4%	merged2b_s6b_b61k
EVA02-E-14	81.9%	laion2b_s4b_b115k
EVA02-E-14-plus	82.7%	laion2b_s9b_b144k

Training Dataset Information

OpenAI WIT

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', 
    pretrained='openai'
)

Dataset: WebImageText (WIT) - 400M image-text pairs
Models: Original OpenAI CLIP models
Tags: openai

LAION Datasets

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='laion2b_s32b_b82k'
)

LAION-400M: 400M English image-text pairs
LAION-2B: 2B English image-text pairs
LAION-5B: 5B multilingual image-text pairs
Tags: laion400m_*, laion2b_*, laion5b_*

DataComp

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='datacomp_xl_s13b_b90k'
)

DataComp-XL: 1.4B filtered image-text pairs
DataComp-L: 1.0B pairs
DataComp-M: 128M pairs
DataComp-S: 13M pairs
Tags: datacomp_xl_*, datacomp_l_*, datacomp_m_*, datacomp_s_*

WebLI (SigLIP)

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-SO400M-14-SigLIP-384',
    pretrained='webli'
)

Dataset: Google’s WebLI dataset
Models: SigLIP models with sigmoid loss
Tags: webli

Model Selection Guide

Speed Priority
Accuracy Priority
Balanced

Fastest inference:

# Small and fast
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',  # 4.4 GFLOPs
    pretrained='laion2b_s34b_b79k'
)

ViT-B-32: Best speed/accuracy tradeoff
RN50: Classic CNN option
MobileCLIP-S1: Ultra-fast mobile deployment

Best zero-shot accuracy:

# State-of-the-art accuracy
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-bigG-14',  # 80.1% ImageNet
    pretrained='laion2b_s39b_b160k'
)

ViT-bigG-14: 80.1% ImageNet zero-shot
ViT-SO400M-14-SigLIP-384: 83.1% (SigLIP)
EVA02-E-14-plus: 82.7% (EVA-CLIP)

Good balance of speed and accuracy:

# Recommended for most use cases
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',  # 79.2% ImageNet
    pretrained='datacomp_xl_s13b_b90k'
)

ViT-L-14: Excellent performance at reasonable cost
ViT-B-16: Good accuracy, moderate speed
ConvNeXt-Base: CNN alternative

Loading from HuggingFace Hub

Many models are available on HuggingFace Hub:

import open_clip

# Direct HuggingFace loading
model, _, preprocess = open_clip.create_model_and_transforms(
    'hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K',
    device='cuda'
)

# Alternative: use standard loading
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='datacomp_xl_s13b_b90k',
    device='cuda'
)

Popular HuggingFace Models

laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
apple/DFN5B-CLIP-ViT-H-14
timm/ViT-SO400M-14-SigLIP-384

HuggingFace models automatically download config and weights. The hf-hub: prefix explicitly specifies HuggingFace as the source.

Multilingual Models

Models trained on multilingual datasets:

# XLM-RoBERTa based models
model, _, preprocess = open_clip.create_model_and_transforms(
    'xlm-roberta-base-ViT-B-32',
    pretrained='laion5b_s13b_b90k',
    device='cuda'
)

# Supports 100+ languages
tokenizer = open_clip.get_tokenizer('xlm-roberta-base-ViT-B-32')

Available multilingual models:

xlm-roberta-base-ViT-B-32
xlm-roberta-large-ViT-H-14
ViT-B-16-SigLIP-i18n-256 (SigLIP multilingual)

Specialized Models

CoCa (Captioning)

Models with generative capabilities:

model, _, preprocess = open_clip.create_model_and_transforms(
    'coca_ViT-L-14',
    pretrained='mscoco_finetuned_laion2b_s13b_b90k'
)

# Generate captions
import torch
from PIL import Image

im = preprocess(Image.open("cat.jpg")).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(im)
print(open_clip.decode(generated[0]))

DFN Models

Apple’s Distilled Feature Networks:

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-H-14',
    pretrained='dfn5b',  # 83.4% ImageNet
    device='cuda'
)

Performance Metrics

For detailed zero-shot performance across 38 datasets, see the full results CSV.

Benchmark Datasets

ImageNet: Primary zero-shot benchmark
ImageNet variants: -A, -R, -Sketch, -V2
Object recognition: CIFAR-10/100, STL-10, Food101
Fine-grained: Flowers102, Pets, Cars, Aircraft
Scene: SUN397, Places365
Action: UCF101, Kinetics700

Example: Finding Best Model

import open_clip

# Find all ViT-L-14 pretrained versions
tags = open_clip.list_pretrained_tags_by_model('ViT-L-14')
print(f"Available ViT-L-14 models: {tags}")

# Load the best performing one
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='datacomp_xl_s13b_b90k',  # 79.2% ImageNet
    device='cuda',
    precision='fp16'
)

model.eval()
print(f"Model loaded with {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

Pretrained Models

Listing Available Models

List All Pretrained Models

List Model Architectures

Query Specific Models

Model Zoo Overview

Vision Transformer (ViT) Models

ResNet Models

ConvNeXt Models

SigLIP Models

EVA Models

Training Dataset Information

OpenAI WIT

LAION Datasets

DataComp

WebLI (SigLIP)

Model Selection Guide

Loading from HuggingFace Hub

Popular HuggingFace Models

Multilingual Models

Specialized Models

CoCa (Captioning)

DFN Models

Performance Metrics

Benchmark Datasets

Example: Finding Best Model

Build docs developers (and LLMs) love

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Listing Available Models

​List All Pretrained Models

​List Model Architectures

​Query Specific Models

​Model Zoo Overview

​Vision Transformer (ViT) Models

​ResNet Models

​ConvNeXt Models

​SigLIP Models

​EVA Models

​Training Dataset Information

​OpenAI WIT

​LAION Datasets

​DataComp

​WebLI (SigLIP)

​Model Selection Guide

​Loading from HuggingFace Hub

​Popular HuggingFace Models

​Multilingual Models

​Specialized Models

​CoCa (Captioning)

​DFN Models

​Performance Metrics

​Benchmark Datasets

​Example: Finding Best Model

Build docs developers (and LLMs) love

Listing Available Models

List All Pretrained Models

List Model Architectures

Query Specific Models

Model Zoo Overview

Vision Transformer (ViT) Models

ResNet Models

ConvNeXt Models

SigLIP Models

EVA Models

Training Dataset Information

OpenAI WIT

LAION Datasets

DataComp

WebLI (SigLIP)

Model Selection Guide

Loading from HuggingFace Hub

Popular HuggingFace Models

Multilingual Models

Specialized Models

CoCa (Captioning)

DFN Models

Performance Metrics

Benchmark Datasets

Example: Finding Best Model