Skip to main content
OpenCLIP provides a comprehensive collection of pretrained CLIP models trained on various datasets at different scales.

Listing Available Models

List All Pretrained Models

import open_clip

# Get all pretrained model/tag combinations
pretrained = open_clip.list_pretrained()
for model_name, tag in pretrained:
    print(f"{model_name}:{tag}")

# Get as formatted strings
pretrained_str = open_clip.list_pretrained(as_str=True)
# Returns: ['RN50:openai', 'ViT-B-32:laion2b_s34b_b79k', ...]

List Model Architectures

import open_clip

# Get all available model architectures
architectures = open_clip.list_models()
print(architectures)
# ['RN50', 'RN101', 'RN50x4', 'ViT-B-32', 'ViT-L-14', ...]

Query Specific Models

from open_clip import list_pretrained_tags_by_model, list_pretrained_models_by_tag

# Get all pretrained tags for a specific model
tags = list_pretrained_tags_by_model('ViT-B-32')
print(tags)
# ['openai', 'laion400m_e31', 'laion2b_s34b_b79k', 'datacomp_xl_s13b_b90k', ...]

# Get all models with a specific tag
models = list_pretrained_models_by_tag('openai')
print(models)
# ['RN50', 'RN101', 'ViT-B-32', 'ViT-L-14', ...]

Model Zoo Overview

Vision Transformer (ViT) Models

ViT models provide excellent accuracy-efficiency tradeoffs:
ModelParametersFLOPsBest Pretrained TagImageNet Zero-Shot
ViT-B-3288M4.4Gdatacomp_xl_s13b_b90k72.8%
ViT-B-1686M17.6Gdatacomp_xl_s13b_b90k73.5%
ViT-L-14304M81.1Gdatacomp_xl_s13b_b90k79.2%
ViT-H-14632M169.1Glaion2b_s32b_b79k78.0%
ViT-g-141.01B257.4Glaion2b_s34b_b88k76.6%
ViT-bigG-141.84B571.3Glaion2b_s39b_b160k80.1%

ResNet Models

Classic CNN-based architectures:
ModelParametersImageNet Zero-ShotBest Tag
RN5038M59.6%openai
RN10156M62.3%openai
RN50x487M68.3%openai
RN50x16167M73.2%openai
RN50x64623M76.6%openai

ConvNeXt Models

Modern CNN architectures with competitive performance:
ModelParametersImageNet Zero-ShotBest Tag
ConvNeXt-Base88M71.5%laion2b_s13b_b82k_augreg
ConvNeXt-Base-W89M72.1%laion2b_s13b_b82k_augreg
ConvNeXt-Large-D200M76.9%laion2b_s29b_b131k_ft
ConvNeXt-XXLarge846M79.5%laion2b_s34b_b82k_augreg

SigLIP Models

Models trained with SigLIP loss function:
ModelResolutionImageNet Zero-ShotTag
ViT-B-16-SigLIP25676.9%webli
ViT-B-16-SigLIP38478.4%webli
ViT-L-16-SigLIP25679.8%webli
ViT-L-16-SigLIP38481.5%webli
ViT-SO400M-14-SigLIP22482.0%webli
ViT-SO400M-14-SigLIP38483.1%webli

EVA Models

State-of-the-art EVA-CLIP models:
ModelImageNet Zero-ShotTag
EVA02-B-1674.7%merged2b_s8b_b131k
EVA02-L-1479.8%merged2b_s4b_b131k
EVA02-L-14-33680.4%merged2b_s6b_b61k
EVA02-E-1481.9%laion2b_s4b_b115k
EVA02-E-14-plus82.7%laion2b_s9b_b144k

Training Dataset Information

OpenAI WIT

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', 
    pretrained='openai'
)
  • Dataset: WebImageText (WIT) - 400M image-text pairs
  • Models: Original OpenAI CLIP models
  • Tags: openai

LAION Datasets

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='laion2b_s32b_b82k'
)
  • LAION-400M: 400M English image-text pairs
  • LAION-2B: 2B English image-text pairs
  • LAION-5B: 5B multilingual image-text pairs
  • Tags: laion400m_*, laion2b_*, laion5b_*

DataComp

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='datacomp_xl_s13b_b90k'
)
  • DataComp-XL: 1.4B filtered image-text pairs
  • DataComp-L: 1.0B pairs
  • DataComp-M: 128M pairs
  • DataComp-S: 13M pairs
  • Tags: datacomp_xl_*, datacomp_l_*, datacomp_m_*, datacomp_s_*

WebLI (SigLIP)

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-SO400M-14-SigLIP-384',
    pretrained='webli'
)
  • Dataset: Google’s WebLI dataset
  • Models: SigLIP models with sigmoid loss
  • Tags: webli

Model Selection Guide

Fastest inference:
# Small and fast
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',  # 4.4 GFLOPs
    pretrained='laion2b_s34b_b79k'
)
  • ViT-B-32: Best speed/accuracy tradeoff
  • RN50: Classic CNN option
  • MobileCLIP-S1: Ultra-fast mobile deployment

Loading from HuggingFace Hub

Many models are available on HuggingFace Hub:
import open_clip

# Direct HuggingFace loading
model, _, preprocess = open_clip.create_model_and_transforms(
    'hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K',
    device='cuda'
)

# Alternative: use standard loading
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='datacomp_xl_s13b_b90k',
    device='cuda'
)
  • laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
  • laion/CLIP-ViT-H-14-laion2B-s32B-b79K
  • laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
  • apple/DFN5B-CLIP-ViT-H-14
  • timm/ViT-SO400M-14-SigLIP-384
HuggingFace models automatically download config and weights. The hf-hub: prefix explicitly specifies HuggingFace as the source.

Multilingual Models

Models trained on multilingual datasets:
# XLM-RoBERTa based models
model, _, preprocess = open_clip.create_model_and_transforms(
    'xlm-roberta-base-ViT-B-32',
    pretrained='laion5b_s13b_b90k',
    device='cuda'
)

# Supports 100+ languages
tokenizer = open_clip.get_tokenizer('xlm-roberta-base-ViT-B-32')
Available multilingual models:
  • xlm-roberta-base-ViT-B-32
  • xlm-roberta-large-ViT-H-14
  • ViT-B-16-SigLIP-i18n-256 (SigLIP multilingual)

Specialized Models

CoCa (Captioning)

Models with generative capabilities:
model, _, preprocess = open_clip.create_model_and_transforms(
    'coca_ViT-L-14',
    pretrained='mscoco_finetuned_laion2b_s13b_b90k'
)

# Generate captions
import torch
from PIL import Image

im = preprocess(Image.open("cat.jpg")).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
    generated = model.generate(im)
print(open_clip.decode(generated[0]))

DFN Models

Apple’s Distilled Feature Networks:
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-H-14',
    pretrained='dfn5b',  # 83.4% ImageNet
    device='cuda'
)

Performance Metrics

For detailed zero-shot performance across 38 datasets, see the full results CSV.

Benchmark Datasets

  • ImageNet: Primary zero-shot benchmark
  • ImageNet variants: -A, -R, -Sketch, -V2
  • Object recognition: CIFAR-10/100, STL-10, Food101
  • Fine-grained: Flowers102, Pets, Cars, Aircraft
  • Scene: SUN397, Places365
  • Action: UCF101, Kinetics700

Example: Finding Best Model

import open_clip

# Find all ViT-L-14 pretrained versions
tags = open_clip.list_pretrained_tags_by_model('ViT-L-14')
print(f"Available ViT-L-14 models: {tags}")

# Load the best performing one
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='datacomp_xl_s13b_b90k',  # 79.2% ImageNet
    device='cuda',
    precision='fp16'
)

model.eval()
print(f"Model loaded with {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")

Build docs developers (and LLMs) love