Overview
Thebuild_zero_shot_classifier function creates zero-shot classification weights by encoding class names with multiple text templates. This is a core component of CLIP’s zero-shot classification capabilities, allowing you to classify images into arbitrary categories without training.
Function Signature
src/open_clip/zero_shot_classifier.py:21
Parameters
The CLIP model instance used for encoding text. Must have an
encode_text method.The CLIP tokenizer instance for converting text to tokens.
A sequence of class (label) names to create the classifier for. For example:
["cat", "dog", "bird"]A sequence of callables or format-friendly strings to produce text prompts per class name.
- If strings: Use format syntax like
"a photo of a {}" - If callables: Lambda functions like
lambda c: f"a photo of a {c}"
The number of classes to batch together in each forward pass. Set to
None to process all classes at once. Batching is useful for managing memory usage with large numbers of classes.Device to use for computation. Examples:
'cpu', 'cuda', 'cuda:0'Enable TQDM progress bar to track processing of class batches.
Returns
A tensor of shape
(embedding_dim, num_classes) containing the normalized classifier weights. Each column represents the averaged and normalized text embeddings for one class across all templates.How It Works
- Template Expansion: Each class name is combined with each template to create multiple text prompts
- Batch Processing: Classes are processed in batches for memory efficiency
- Text Encoding: Each text prompt is tokenized and encoded using the model’s text encoder
- Averaging: For each class, embeddings from all templates are averaged
- Normalization: The averaged embeddings are L2-normalized
- Transposition: The result is transposed to shape
(embedding_dim, num_classes)for use as classifier weights
Usage Examples
Basic Usage
Using Pre-defined Templates
Zero-Shot Classification Pipeline
Memory-Efficient Processing
Notes
- The function automatically detects whether templates are strings or callables based on the first template
- All templates are applied to all class names, creating
len(classnames) × len(templates)total prompts - Embeddings are normalized both after averaging and before transposition to ensure proper similarity computation
- The resulting classifier can be used directly with normalized image features via matrix multiplication
- For large numbers of classes, adjust
num_classes_per_batchbased on available GPU memory
Related Functions
build_zero_shot_classifier_legacy- Original implementation that processes classes one at a time- Zero-Shot Metadata - Pre-defined templates and class names
Legacy Version
The module also includesbuild_zero_shot_classifier_legacy which processes classes one at a time instead of in batches. This is slower but may be useful for compatibility:
num_classes_per_batch and processes each class sequentially.