Skip to main content

Quickstart

This guide will get you up and running with OpenCLIP in just a few minutes. You’ll learn how to load a pretrained model and perform zero-shot image classification.

Prerequisites

Make sure you have OpenCLIP installed:
pip install open_clip_torch

Basic Usage

Here’s a complete example of loading a model and classifying an image:
1

Import libraries

Import OpenCLIP and required dependencies:
import torch
from PIL import Image
import open_clip
2

Load model and preprocessing

Create a model with pretrained weights and get the preprocessing transform:
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)
model.eval()  # Set to evaluation mode

# Get the tokenizer for text
tokenizer = open_clip.get_tokenizer('ViT-B-32')
Models are in training mode by default, which affects BatchNorm and dropout layers. Always call model.eval() for inference.
3

Prepare image and text

Load and preprocess an image, then tokenize text labels:
# Load and preprocess an image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

# Define candidate labels
text = tokenizer(["a diagram", "a dog", "a cat"])
4

Compute embeddings and similarity

Run inference to get image-text similarity scores:
with torch.no_grad(), torch.autocast("cuda"):
    # Encode image and text
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Calculate similarity and get probabilities
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probabilities:", text_probs)
# Output: Label probabilities: tensor([[0.9927, 0.0038, 0.0035]])
5

Interpret results

Get the most likely label:
# Get the top prediction
labels = ["a diagram", "a dog", "a cat"]
top_prob, top_idx = text_probs[0].max(dim=0)

print(f"Predicted: {labels[top_idx]} ({top_prob.item():.1%} confidence)")
# Output: Predicted: a diagram (99.3% confidence)

Complete Example

Here’s the full code in one block:
import torch
from PIL import Image
import open_clip

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Prepare inputs
image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

# Run inference
with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

Exploring Available Models

OpenCLIP provides 80+ pretrained models. List them all:
import open_clip

# Get all available models and their pretrained variants
models = open_clip.list_pretrained()

# Print first 10 (model_name, pretrained_tag) pairs
for model_name, pretrained in models[:10]:
    print(f"{model_name}: {pretrained}")
Example output:
ViT-B-32: openai
ViT-B-32: laion400m_e31
ViT-B-32: laion400m_e32
ViT-B-32: laion2b_e16
ViT-B-32: laion2b_s34b_b79k
ViT-B-16: openai
ViT-B-16: laion400m_e31
ViT-B-16: laion400m_e32
ViT-L-14: openai
ViT-L-14: laion400m_e31
Each model can have multiple pretrained versions trained on different datasets (OpenAI, LAION-400M, LAION-2B, DataComp) with different training configurations.

Using Different Models

Switch to a different model architecture or pretrained variant:
# Original OpenAI CLIP ViT-L/14
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='openai'
)

GPU Acceleration

For faster inference, move the model to GPU:
import torch
import open_clip

device = "cuda" if torch.cuda.is_available() else "cpu"

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='openai',
    device=device
)
model.eval()

tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Prepare inputs and move to GPU
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to(device)
text = tokenizer(["a dog", "a cat"]).to(device)

# Run inference with automatic mixed precision
with torch.no_grad(), torch.autocast(device):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print(similarity)

Loading Local Checkpoints

You can load models from local files instead of downloading:
# Load from local checkpoint
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='/path/to/checkpoint.pt'
)

Loading from Hugging Face

Load models directly from the Hugging Face Hub:
# Download and load from Hugging Face
model, _, preprocess = open_clip.create_model_and_transforms(
    'hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K'
)
The first time you load a model, it will be downloaded and cached. Subsequent loads will use the cached version.

Common Use Cases

Find the most similar image from a collection:
import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', pretrained='openai'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Encode a text query
text = tokenizer(["a photo of a sunset"])
with torch.no_grad():
    text_features = model.encode_text(text)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Encode multiple images
images = [preprocess(Image.open(f"image_{i}.jpg")).unsqueeze(0) for i in range(5)]
images = torch.cat(images)

with torch.no_grad():
    image_features = model.encode_image(images)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    
    # Compute similarity
    similarity = (100.0 * text_features @ image_features.T)
    
# Get most similar image
best_idx = similarity.argmax().item()
print(f"Most similar image: image_{best_idx}.jpg")

Zero-Shot Classification

Classify images without training examples:
from PIL import Image
import open_clip
import torch

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', pretrained='openai'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Define your custom classes
classes = [
    "a photo of a cat",
    "a photo of a dog", 
    "a photo of a bird",
    "a photo of a fish",
    "a photo of a horse"
]

image = preprocess(Image.open("animal.jpg")).unsqueeze(0)
text = tokenizer(classes)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)[0]

# Print results
for class_name, prob in zip(classes, probs):
    print(f"{class_name}: {prob.item():.2%}")

Next Steps

Now that you understand the basics, explore more advanced topics:
  • Model Zoo: Browse all available pretrained models and their performance
  • Fine-tuning: Learn how to fine-tune models on your own datasets
  • Training: Train CLIP models from scratch on custom data
  • Advanced Usage: Batch processing, custom preprocessing, and optimization techniques
For computing billions of embeddings efficiently, check out clip-retrieval which has OpenCLIP support.

Build docs developers (and LLMs) love