Skip to main content

Overview

The zero_shot_metadata module provides pre-defined text templates and class names for zero-shot image classification. These templates are used with build_zero_shot_classifier to create robust classifiers without training. Source: src/open_clip/zero_shot_metadata.py

Template Collections

OPENAI_IMAGENET_TEMPLATES

OPENAI_IMAGENET_TEMPLATES: Tuple[Callable[[str], str], ...]
A comprehensive collection of 80 text templates derived from OpenAI’s CLIP research. These templates provide diverse contextual variations to improve classification robustness. Examples of templates:
  • lambda c: f'a photo of a {c}.'
  • lambda c: f'a bad photo of a {c}.'
  • lambda c: f'a photo of many {c}.'
  • lambda c: f'a sculpture of a {c}.'
  • lambda c: f'a low resolution photo of the {c}.'
  • lambda c: f'a rendering of a {c}.'
  • lambda c: f'graffiti of a {c}.'
  • lambda c: f'a cropped photo of the {c}.'
  • lambda c: f'a bright photo of a {c}.'
  • lambda c: f'a dark photo of the {c}.'
  • lambda c: f'a black and white photo of the {c}.'
  • lambda c: f'a painting of the {c}.'
  • lambda c: f'a {c} in a video game.'
  • lambda c: f'itap of a {c}.' (“I took a picture of”)
Total: 80 templates covering various visual styles, conditions, and contexts. Source: src/open_clip/zero_shot_metadata.py:2

SIMPLE_IMAGENET_TEMPLATES

SIMPLE_IMAGENET_TEMPLATES: Tuple[Callable[[str], str], ...]
A smaller, curated subset of 7 templates from the OpenAI CLIP Prompt Engineering notebook. This provides a good balance between accuracy and computational efficiency. Templates:
SIMPLE_IMAGENET_TEMPLATES = (
    lambda c: f'itap of a {c}.',
    lambda c: f'a bad photo of the {c}.',
    lambda c: f'a origami {c}.',
    lambda c: f'a photo of the large {c}.',
    lambda c: f'a {c} in a video game.',
    lambda c: f'art of the {c}.',
    lambda c: f'a photo of the small {c}.',
)
Source: src/open_clip/zero_shot_metadata.py:88 Reference: OpenAI CLIP Prompt Engineering Notebook

Class Names

IMAGENET_CLASSNAMES

IMAGENET_CLASSNAMES: Tuple[str, ...]
Complete list of 1,000 ImageNet class names in the standard ImageNet-1K order. These are human-readable labels corresponding to ImageNet synsets. Examples:
  • "tench", "goldfish", "great white shark"
  • "tabby cat", "tiger cat", "Persian cat"
  • "golden retriever", "labrador retriever"
  • "laptop computer", "desktop computer"
  • "pizza", "cheeseburger", "ice cream"
Total: 1,000 class names covering animals, objects, vehicles, food, and more. Source: src/open_clip/zero_shot_metadata.py:99

Usage Examples

Using SIMPLE_IMAGENET_TEMPLATES

import open_clip
from open_clip.zero_shot_classifier import build_zero_shot_classifier
from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES

# Load model
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
model = model.to('cuda')

# Define custom classes with simple templates
my_classes = ['cat', 'dog', 'car', 'airplane']
classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    my_classes,
    SIMPLE_IMAGENET_TEMPLATES,
    device='cuda'
)

print(f"Created classifier with {len(SIMPLE_IMAGENET_TEMPLATES)} templates")
# Output: Created classifier with 7 templates

Full ImageNet Classification

import torch
import open_clip
from PIL import Image
from open_clip.zero_shot_classifier import build_zero_shot_classifier
from open_clip.zero_shot_metadata import (
    IMAGENET_CLASSNAMES,
    OPENAI_IMAGENET_TEMPLATES
)

# Setup model
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14',
    pretrained='openai'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')
device = 'cuda'
model = model.to(device)

# Build full ImageNet classifier
print(f"Building classifier for {len(IMAGENET_CLASSNAMES)} classes...")
classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    IMAGENET_CLASSNAMES,
    OPENAI_IMAGENET_TEMPLATES,
    num_classes_per_batch=50,
    device=device,
    use_tqdm=True
)

print(f"Classifier shape: {classifier.shape}")  # (768, 1000) for ViT-L-14

# Classify an image
image = preprocess(Image.open('cat.jpg')).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image, normalize=True)
    logits = 100.0 * image_features @ classifier
    probs = logits.softmax(dim=-1)

# Get top 5 predictions
top5_probs, top5_indices = probs[0].topk(5)
for i, (prob, idx) in enumerate(zip(top5_probs, top5_indices)):
    print(f"{i+1}. {IMAGENET_CLASSNAMES[idx]}: {prob.item():.2%}")

Comparing Template Sets

import torch
import open_clip
from open_clip.zero_shot_classifier import build_zero_shot_classifier
from open_clip.zero_shot_metadata import (
    SIMPLE_IMAGENET_TEMPLATES,
    OPENAI_IMAGENET_TEMPLATES
)

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
model = model.to('cuda')

classes = ['cat', 'dog', 'bird']

# Simple templates (7 templates)
simple_classifier = build_zero_shot_classifier(
    model, tokenizer, classes, SIMPLE_IMAGENET_TEMPLATES, device='cuda'
)

# Full templates (80 templates)
full_classifier = build_zero_shot_classifier(
    model, tokenizer, classes, OPENAI_IMAGENET_TEMPLATES, device='cuda'
)

print(f"Simple templates: {len(SIMPLE_IMAGENET_TEMPLATES)}")
print(f"Full templates: {len(OPENAI_IMAGENET_TEMPLATES)}")
# Simple templates: 7
# Full templates: 80

Custom Classes with Pre-defined Templates

from open_clip.zero_shot_classifier import build_zero_shot_classifier
from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-16')

# Domain-specific classes with general templates
medical_classes = [
    'chest x-ray',
    'brain MRI',
    'ultrasound',
    'CT scan'
]

classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    medical_classes,
    SIMPLE_IMAGENET_TEMPLATES,  # Works for domain-specific classes too
    device='cuda'
)

Creating Custom Template Variants

from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES

# Convert to list for modification
custom_templates = list(SIMPLE_IMAGENET_TEMPLATES)

# Add domain-specific templates
custom_templates.extend([
    lambda c: f'a high quality photo of a {c}.',
    lambda c: f'a professional photo of a {c}.',
])

print(f"Total templates: {len(custom_templates)}")
# Total templates: 9

classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    classnames,
    custom_templates,
    device='cuda'
)

Inspecting Template Output

from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES

# See what prompts are generated for a class
class_name = "cat"
print(f"Prompts for '{class_name}':")
for i, template in enumerate(SIMPLE_IMAGENET_TEMPLATES, 1):
    print(f"{i}. {template(class_name)}")

# Output:
# Prompts for 'cat':
# 1. itap of a cat.
# 2. a bad photo of the cat.
# 3. a origami cat.
# 4. a photo of the large cat.
# 5. a cat in a video game.
# 6. art of the cat.
# 7. a photo of the small cat.

Template Design Notes

Why Multiple Templates?

Using multiple templates improves classification robustness by:
  1. Handling ambiguity - Different phrasings capture different aspects of a concept
  2. Averaging out noise - Multiple templates reduce sensitivity to specific wording
  3. Covering variations - Templates account for different visual presentations (size, quality, style)

Template Selection

  • SIMPLE_IMAGENET_TEMPLATES: Use for faster inference with good accuracy (7 templates)
  • OPENAI_IMAGENET_TEMPLATES: Use for best accuracy when computation time allows (80 templates)
  • Custom templates: Create domain-specific templates for specialized applications

Performance Trade-offs

# Computation scales with number of templates
num_prompts = len(classnames) * len(templates)

# Example:
# 1000 classes × 7 templates = 7,000 text encodings
# 1000 classes × 80 templates = 80,000 text encodings
For large-scale applications, SIMPLE_IMAGENET_TEMPLATES provides the best balance of accuracy and speed.

References

Build docs developers (and LLMs) love