build_zero_shot_classifier

Overview

The build_zero_shot_classifier function creates zero-shot classification weights by encoding class names with multiple text templates. This is a core component of CLIP’s zero-shot classification capabilities, allowing you to classify images into arbitrary categories without training.

Function Signature

def build_zero_shot_classifier(
    model,
    tokenizer,
    classnames: Sequence[str],
    templates: Sequence[Union[Callable, str]],
    num_classes_per_batch: Optional[int] = 10,
    device: Union[str, torch.device] = 'cpu',
    use_tqdm: bool = False,
) -> torch.Tensor

Source: src/open_clip/zero_shot_classifier.py:21

Parameters

model

CLIP model instance

required

The CLIP model instance used for encoding text. Must have an encode_text method.

tokenizer

tokenizer instance

required

The CLIP tokenizer instance for converting text to tokens.

classnames

Sequence[str]

required

A sequence of class (label) names to create the classifier for. For example: ["cat", "dog", "bird"]

templates

Sequence[Union[Callable, str]]

required

A sequence of callables or format-friendly strings to produce text prompts per class name.

If strings: Use format syntax like "a photo of a {}"
If callables: Lambda functions like lambda c: f"a photo of a {c}"

See zero_shot_metadata for pre-defined template sets.

num_classes_per_batch

Optional[int]

default:"10"

The number of classes to batch together in each forward pass. Set to None to process all classes at once. Batching is useful for managing memory usage with large numbers of classes.

device

Union[str, torch.device]

default:"'cpu'"

Device to use for computation. Examples: 'cpu', 'cuda', 'cuda:0'

use_tqdm

bool

default:"False"

Enable TQDM progress bar to track processing of class batches.

Returns

zeroshot_weights

torch.Tensor

A tensor of shape (embedding_dim, num_classes) containing the normalized classifier weights. Each column represents the averaged and normalized text embeddings for one class across all templates.

How It Works

Template Expansion: Each class name is combined with each template to create multiple text prompts
Batch Processing: Classes are processed in batches for memory efficiency
Text Encoding: Each text prompt is tokenized and encoded using the model’s text encoder
Averaging: For each class, embeddings from all templates are averaged
Normalization: The averaged embeddings are L2-normalized
Transposition: The result is transposed to shape (embedding_dim, num_classes) for use as classifier weights

Usage Examples

Basic Usage

import open_clip
from open_clip.zero_shot_classifier import build_zero_shot_classifier

# Load model and tokenizer
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Define classes and templates
classnames = ['cat', 'dog', 'bird', 'fish']
templates = [
    'a photo of a {}',
    'a picture of a {}',
    'an image of a {}',
]

# Build classifier
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    classnames,
    templates,
    device=device
)

print(f"Classifier shape: {classifier.shape}")  # (embedding_dim, 4)

Using Pre-defined Templates

from open_clip.zero_shot_metadata import SIMPLE_IMAGENET_TEMPLATES, IMAGENET_CLASSNAMES
from open_clip.zero_shot_classifier import build_zero_shot_classifier
import open_clip

# Load model
model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-L-14')
device = 'cuda'
model = model.to(device)

# Build ImageNet classifier with simple templates
classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    IMAGENET_CLASSNAMES,
    SIMPLE_IMAGENET_TEMPLATES,
    num_classes_per_batch=50,  # Process 50 classes at a time
    device=device,
    use_tqdm=True  # Show progress
)

Zero-Shot Classification Pipeline

import torch
import open_clip
from PIL import Image
from open_clip.zero_shot_classifier import build_zero_shot_classifier

# Setup
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
device = 'cuda'
model = model.to(device)

# Build classifier
classnames = ['cat', 'dog', 'bird', 'fish', 'horse']
templates = ['a photo of a {}']
classifier = build_zero_shot_classifier(
    model, tokenizer, classnames, templates, device=device
)

# Load and preprocess image
image = preprocess(Image.open('image.jpg')).unsqueeze(0).to(device)

# Encode image and compute similarities
with torch.no_grad():
    image_features = model.encode_image(image, normalize=True)
    logits = 100.0 * image_features @ classifier
    probs = logits.softmax(dim=-1)

# Get predictions
top_prob, top_idx = probs[0].topk(3)
for i, (prob, idx) in enumerate(zip(top_prob, top_idx)):
    print(f"{i+1}. {classnames[idx]}: {prob.item():.2%}")

Memory-Efficient Processing

# For very large class sets, use smaller batch sizes
classifier = build_zero_shot_classifier(
    model,
    tokenizer,
    classnames=list_of_10000_classes,
    templates=SIMPLE_IMAGENET_TEMPLATES,
    num_classes_per_batch=20,  # Small batch size for memory efficiency
    device='cuda',
    use_tqdm=True
)

Notes

The function automatically detects whether templates are strings or callables based on the first template
All templates are applied to all class names, creating len(classnames) × len(templates) total prompts
Embeddings are normalized both after averaging and before transposition to ensure proper similarity computation
The resulting classifier can be used directly with normalized image features via matrix multiplication
For large numbers of classes, adjust num_classes_per_batch based on available GPU memory

build_zero_shot_classifier_legacy - Original implementation that processes classes one at a time
Zero-Shot Metadata - Pre-defined templates and class names

Legacy Version

The module also includes build_zero_shot_classifier_legacy which processes classes one at a time instead of in batches. This is slower but may be useful for compatibility:

def build_zero_shot_classifier_legacy(
    model,
    tokenizer,
    classnames: Sequence[str],
    templates: Sequence[Union[Callable, str]],
    device: Union[str, torch.device] = 'cpu',
    use_tqdm: bool = False,
) -> torch.Tensor

The legacy version does not support num_classes_per_batch and processes each class sequentially.

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

build_zero_shot_classifier

Overview

Function Signature

Parameters

Returns

How It Works

Usage Examples

Basic Usage

Using Pre-defined Templates

Zero-Shot Classification Pipeline

Memory-Efficient Processing

Notes

Legacy Version

Build docs developers (and LLMs) love

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

​Overview

​Function Signature

​Parameters

​Returns

​How It Works

​Usage Examples

​Basic Usage

​Using Pre-defined Templates

​Zero-Shot Classification Pipeline

​Memory-Efficient Processing

​Notes

​Related Functions

​Legacy Version

Build docs developers (and LLMs) love

Overview

Function Signature

Parameters

Returns

How It Works

Usage Examples

Basic Usage

Using Pre-defined Templates

Zero-Shot Classification Pipeline

Memory-Efficient Processing

Notes

Related Functions

Legacy Version