Skip to main content

Overview

The build_zero_shot_classifier function creates zero-shot classification weights by encoding class names with multiple text templates. This is a core component of CLIP’s zero-shot classification capabilities, allowing you to classify images into arbitrary categories without training.

Function Signature

def build_zero_shot_classifier(
    model,
    tokenizer,
    classnames: Sequence[str],
    templates: Sequence[Union[Callable, str]],
    num_classes_per_batch: Optional[int] = 10,
    device: Union[str, torch.device] = 'cpu',
    use_tqdm: bool = False,
) -> torch.Tensor
Source: src/open_clip/zero_shot_classifier.py:21

Parameters

model
CLIP model instance
required
The CLIP model instance used for encoding text. Must have an encode_text method.
tokenizer
tokenizer instance
required
The CLIP tokenizer instance for converting text to tokens.
classnames
Sequence[str]
required
A sequence of class (label) names to create the classifier for. For example: ["cat", "dog", "bird"]
templates
Sequence[Union[Callable, str]]
required
A sequence of callables or format-friendly strings to produce text prompts per class name.
  • If strings: Use format syntax like "a photo of a {}"
  • If callables: Lambda functions like lambda c: f"a photo of a {c}"
See zero_shot_metadata for pre-defined template sets.
num_classes_per_batch
Optional[int]
default:"10"
The number of classes to batch together in each forward pass. Set to None to process all classes at once. Batching is useful for managing memory usage with large numbers of classes.
device
Union[str, torch.device]
default:"'cpu'"
Device to use for computation. Examples: 'cpu', 'cuda', 'cuda:0'
use_tqdm
bool
default:"False"
Enable TQDM progress bar to track processing of class batches.

Returns

zeroshot_weights
torch.Tensor
A tensor of shape (embedding_dim, num_classes) containing the normalized classifier weights. Each column represents the averaged and normalized text embeddings for one class across all templates.

How It Works

  1. Template Expansion: Each class name is combined with each template to create multiple text prompts
  2. Batch Processing: Classes are processed in batches for memory efficiency
  3. Text Encoding: Each text prompt is tokenized and encoded using the model’s text encoder
  4. Averaging: For each class, embeddings from all templates are averaged
  5. Normalization: The averaged embeddings are L2-normalized
  6. Transposition: The result is transposed to shape (embedding_dim, num_classes) for use as classifier weights

Usage Examples

Basic Usage

import open_clip
from open_clip.zero_shot_classifier import build_zero_shot_classifier

# Load model and tokenizer
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Define classes and templates
classnames = ['cat', 'dog', 'bird', 'fish']
templates = [
    'a photo of a {}',
    'a picture of a {}',
    'an image of a {}',
]

# Build classifier
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    classnames,
    templates,
    device=device
)

print(f"Classifier shape: {classifier.shape}")  # (embedding_dim, 4)

Using Pre-defined Templates

from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES, IMAGENET_CLASSNAMES
from open_clip.zero_shot_classifier import build_zero_shot_classifier
import open_clip

# Load model
model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-L-14')
device = 'cuda'
model = model.to(device)

# Build ImageNet classifier with simple templates
classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    IMAGENET_CLASSNAMES,
    SIMPLE_IMAGENET_TEMPLATES,
    num_classes_per_batch=50,  # Process 50 classes at a time
    device=device,
    use_tqdm=True  # Show progress
)

Zero-Shot Classification Pipeline

import torch
import open_clip
from PIL import Image
from open_clip.zero_shot_classifier import build_zero_shot_classifier

# Setup
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
device = 'cuda'
model = model.to(device)

# Build classifier
classnames = ['cat', 'dog', 'bird', 'fish', 'horse']
templates = ['a photo of a {}']
classifier = build_zero_shot_classifier(
    model, tokenizer, classnames, templates, device=device
)

# Load and preprocess image
image = preprocess(Image.open('image.jpg')).unsqueeze(0).to(device)

# Encode image and compute similarities
with torch.no_grad():
    image_features = model.encode_image(image, normalize=True)
    logits = 100.0 * image_features @ classifier
    probs = logits.softmax(dim=-1)

# Get predictions
top_prob, top_idx = probs[0].topk(3)
for i, (prob, idx) in enumerate(zip(top_prob, top_idx)):
    print(f"{i+1}. {classnames[idx]}: {prob.item():.2%}")

Memory-Efficient Processing

# For very large class sets, use smaller batch sizes
classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    classnames=list_of_10000_classes,
    templates=SIMPLE_IMAGENET_TEMPLATES,
    num_classes_per_batch=20,  # Small batch size for memory efficiency
    device='cuda',
    use_tqdm=True
)

Notes

  • The function automatically detects whether templates are strings or callables based on the first template
  • All templates are applied to all class names, creating len(classnames) × len(templates) total prompts
  • Embeddings are normalized both after averaging and before transposition to ensure proper similarity computation
  • The resulting classifier can be used directly with normalized image features via matrix multiplication
  • For large numbers of classes, adjust num_classes_per_batch based on available GPU memory

Legacy Version

The module also includes build_zero_shot_classifier_legacy which processes classes one at a time instead of in batches. This is slower but may be useful for compatibility:
def build_zero_shot_classifier_legacy(
    model,
    tokenizer,
    classnames: Sequence[str],
    templates: Sequence[Union[Callable, str]],
    device: Union[str, torch.device] = 'cpu',
    use_tqdm: bool = False,
) -> torch.Tensor
The legacy version does not support num_classes_per_batch and processes each class sequentially.

Build docs developers (and LLMs) love