Quickstart

This guide will get you up and running with OpenCLIP in just a few minutes. You’ll learn how to load a pretrained model and perform zero-shot image classification.

Prerequisites

Make sure you have OpenCLIP installed:

pip install open_clip_torch

Basic Usage

Here’s a complete example of loading a model and classifying an image:

Import libraries

Import OpenCLIP and required dependencies:

import torch
from PIL import Image
import open_clip

Load model and preprocessing

Create a model with pretrained weights and get the preprocessing transform:

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)
model.eval()  # Set to evaluation mode

# Get the tokenizer for text
tokenizer = open_clip.get_tokenizer('ViT-B-32')

Models are in training mode by default, which affects BatchNorm and dropout layers. Always call model.eval() for inference.

Prepare image and text

Load and preprocess an image, then tokenize text labels:

# Load and preprocess an image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

# Define candidate labels
text = tokenizer(["a diagram", "a dog", "a cat"])

Compute embeddings and similarity

Run inference to get image-text similarity scores:

with torch.no_grad(), torch.autocast("cuda"):
    # Encode image and text
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Calculate similarity and get probabilities
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probabilities:", text_probs)
# Output: Label probabilities: tensor([[0.9927, 0.0038, 0.0035]])

Interpret results

Get the most likely label:

# Get the top prediction
labels = ["a diagram", "a dog", "a cat"]
top_prob, top_idx = text_probs[0].max(dim=0)

print(f"Predicted: {labels[top_idx]} ({top_prob.item():.1%} confidence)")
# Output: Predicted: a diagram (99.3% confidence)

Complete Example

Here’s the full code in one block:

import torch
from PIL import Image
import open_clip

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Prepare inputs
image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

# Run inference
with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

Exploring Available Models

OpenCLIP provides 80+ pretrained models. List them all:

import open_clip

# Get all available models and their pretrained variants
models = open_clip.list_pretrained()

# Print first 10 (model_name, pretrained_tag) pairs
for model_name, pretrained in models[:10]:
    print(f"{model_name}: {pretrained}")

Example output:

ViT-B-32: openai
ViT-B-32: laion400m_e31
ViT-B-32: laion400m_e32
ViT-B-32: laion2b_e16
ViT-B-32: laion2b_s34b_b79k
ViT-B-16: openai
ViT-B-16: laion400m_e31
ViT-B-16: laion400m_e32
ViT-L-14: openai
ViT-L-14: laion400m_e31

Each model can have multiple pretrained versions trained on different datasets (OpenAI, LAION-400M, LAION-2B, DataComp) with different training configurations.

Using Different Models

Switch to a different model architecture or pretrained variant:

# Original OpenAI CLIP ViT-L/14
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='openai'
)

GPU Acceleration

For faster inference, move the model to GPU:

import torch
import open_clip

device = "cuda" if torch.cuda.is_available() else "cpu"

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='openai',
    device=device
)
model.eval()

tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Prepare inputs and move to GPU
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to(device)
text = tokenizer(["a dog", "a cat"]).to(device)

# Run inference with automatic mixed precision
with torch.no_grad(), torch.autocast(device):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print(similarity)

Loading Local Checkpoints

You can load models from local files instead of downloading:

# Load from local checkpoint
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='/path/to/checkpoint.pt'
)

Loading from Hugging Face

Load models directly from the Hugging Face Hub:

# Download and load from Hugging Face
model, _, preprocess = open_clip.create_model_and_transforms(
    'hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K'
)

The first time you load a model, it will be downloaded and cached. Subsequent loads will use the cached version.

Common Use Cases

Image Search

Find the most similar image from a collection:

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', pretrained='openai'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Encode a text query
text = tokenizer(["a photo of a sunset"])
with torch.no_grad():
    text_features = model.encode_text(text)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Encode multiple images
images = [preprocess(Image.open(f"image_{i}.jpg")).unsqueeze(0) for i in range(5)]
images = torch.cat(images)

with torch.no_grad():
    image_features = model.encode_image(images)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    
    # Compute similarity
    similarity = (100.0 * text_features @ image_features.T)
    
# Get most similar image
best_idx = similarity.argmax().item()
print(f"Most similar image: image_{best_idx}.jpg")

Zero-Shot Classification

Classify images without training examples:

from PIL import Image
import open_clip
import torch

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', pretrained='openai'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Define your custom classes
classes = [
    "a photo of a cat",
    "a photo of a dog", 
    "a photo of a bird",
    "a photo of a fish",
    "a photo of a horse"
]

image = preprocess(Image.open("animal.jpg")).unsqueeze(0)
text = tokenizer(classes)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)[0]

# Print results
for class_name, prob in zip(classes, probs):
    print(f"{class_name}: {prob.item():.2%}")

Next Steps

Now that you understand the basics, explore more advanced topics:

Model Zoo: Browse all available pretrained models and their performance
Fine-tuning: Learn how to fine-tune models on your own datasets
Training: Train CLIP models from scratch on custom data
Advanced Usage: Batch processing, custom preprocessing, and optimization techniques

For computing billions of embeddings efficiently, check out clip-retrieval which has OpenCLIP support.

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

Quickstart

Quickstart

Prerequisites

Basic Usage

Complete Example

Exploring Available Models

Using Different Models

GPU Acceleration

Loading Local Checkpoints

Loading from Hugging Face

Common Use Cases

Image Search

Zero-Shot Classification

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Quickstart

​Prerequisites

​Basic Usage

​Complete Example

​Exploring Available Models

​Using Different Models

​GPU Acceleration

​Loading Local Checkpoints

​Loading from Hugging Face

​Common Use Cases

​Image Search

​Zero-Shot Classification

​Next Steps

Build docs developers (and LLMs) love

Quickstart

Prerequisites

Basic Usage

Complete Example

Exploring Available Models

Using Different Models

GPU Acceleration

Loading Local Checkpoints

Loading from Hugging Face

Common Use Cases

Image Search

Zero-Shot Classification

Next Steps