Zero-Shot Classification
One of CLIP’s most powerful capabilities is zero-shot classification : the ability to classify images into categories the model has never been explicitly trained on. This is achieved by comparing image embeddings with text embeddings of potential class labels.
Core Concept
Instead of learning a fixed classifier head for specific categories, CLIP:
Encodes the image into an embedding vector
Encodes candidate text labels (e.g., “a photo of a dog”) into embedding vectors
Computes similarity scores between the image and each text embedding
Selects the highest scoring label as the prediction
Key Insight : Classification becomes a similarity search problem in the joint embedding space, not a traditional softmax over learned weights.
How It Works
Step 1: Prepare Text Prompts
Convert class names into descriptive text prompts using templates:
classnames = [ "dog" , "cat" , "bird" ]
templates = [
"a photo of a {} ." ,
"a picture of a {} ." ,
"an image of a {} ." ,
]
# Generate prompts
prompts = [
"a photo of a dog." , "a picture of a dog." , "an image of a dog." ,
"a photo of a cat." , "a picture of a cat." , "an image of a cat." ,
"a photo of a bird." , "a picture of a bird." , "an image of a bird." ,
]
Why templates? Context matters! “a photo of a dog” provides more semantic information than just “dog”, leading to better embeddings.
Step 2: Build Zero-Shot Classifier Weights
From src/open_clip/zero_shot_classifier.py:21-68:
def build_zero_shot_classifier (
model ,
tokenizer ,
classnames : Sequence[ str ],
templates : Sequence[Union[Callable, str ]],
num_classes_per_batch : Optional[ int ] = 10 ,
device : Union[ str , torch.device] = 'cpu' ,
use_tqdm : bool = False ,
):
""" Build zero-shot classifier weights by iterating over class names in batches """
def _process_batch ( batch_classnames ):
num_batch_classes = len (batch_classnames)
# Generate all text prompts for this batch
texts = [template.format(c) if use_format else template(c)
for c in batch_classnames for template in templates]
# Tokenize and encode
texts = tokenizer(texts).to(device)
class_embeddings = model.encode_text(texts, normalize = True )
# Average embeddings across templates
class_embeddings = class_embeddings.reshape(
num_batch_classes, num_templates, - 1
).mean( dim = 1 )
# Re-normalize after averaging
class_embeddings = class_embeddings / class_embeddings.norm( dim = 1 , keepdim = True )
class_embeddings = class_embeddings.T # Shape: [embed_dim, num_classes]
return class_embeddings
with torch.no_grad():
if num_classes_per_batch:
batched_embeds = [_process_batch(batch)
for batch in batched(classnames, num_classes_per_batch)]
zeroshot_weights = torch.cat(batched_embeds, dim = 1 )
else :
zeroshot_weights = _process_batch(classnames)
return zeroshot_weights
Key steps:
Generate prompts for each class using multiple templates
Encode all prompts to get text embeddings
Average embeddings across templates for each class (ensemble)
Normalize to unit length
Transpose to shape [embed_dim, num_classes]
Step 3: Classify Images
From src/open_clip_train/zero_shot.py:17-42:
def run ( model , classifier , dataloader , args ):
""" Run zero-shot classification on a dataset """
device = torch.device(args.device)
with torch.inference_mode():
top1, top5, n = 0 ., 0 ., 0 .
for images, target in tqdm(dataloader, unit_scale = args.batch_size):
images = images.to( device = device)
target = target.to(device)
with autocast():
# Encode image
output = model( image = images)
image_features = output[ 'image_features' ] if isinstance (output, dict ) else output[ 0 ]
# Compute similarity with classifier weights
logits = 100 . * image_features @ classifier
# Measure accuracy
acc1, acc5 = accuracy(logits, target, topk = ( 1 , 5 ))
top1 += acc1
top5 += acc5
n += images.size( 0 )
return top1 / n, top5 / n
Classification process:
Encode image → normalized embedding vector
Matrix multiply with classifier weights: logits = image_features @ zeroshot_weights
Scale by 100 (temperature scaling)
Argmax to get predicted class
Temperature Scaling and Similarity Computation
Cosine Similarity
Since both image and text embeddings are L2-normalized, their dot product equals cosine similarity :
similarity = image_features @ text_features.T
= ||image|| * ||text|| * cos(θ)
= 1 * 1 * cos(θ) [since normalized]
= cos(θ)
Values range from -1 (opposite) to +1 (identical).
Temperature Scaling
The scaling factor (100.0 in the example) controls prediction confidence:
logits = temperature * image_features @ classifier
Higher temperature → sharper probability distribution, more confident predictions
Lower temperature → softer distribution, less confident predictions
From the CLIP model (src/open_clip/model.py:274-298):
class CLIP ( nn . Module ):
def __init__ (
self ,
embed_dim : int ,
...
init_logit_scale : float = np.log( 1 / 0.07 ), # ≈ 2.66
...
):
self .logit_scale = nn.Parameter(torch.ones([]) * init_logit_scale)
During training, logit_scale is learned. At inference:
temperature = self .logit_scale.exp() # Typically ~100
logits = temperature * image_features @ text_features.T
Softmax Probabilities
To get class probabilities:
probs = logits.softmax( dim =- 1 )
# probs[i] = probability that image belongs to class i
Real Example from Codebase
ImageNet Zero-Shot Evaluation
From src/open_clip_train/zero_shot.py:45-86:
def zero_shot_eval ( model , data , epoch , args , tokenizer = None ):
""" Evaluate zero-shot ImageNet classification during training """
if tokenizer is None :
tokenizer = get_tokenizer(args.model)
logging.info( 'Building zero-shot classifier' )
device = torch.device(args.device)
with autocast():
# Build classifier using ImageNet-1K class names
classifier = build_zero_shot_classifier(
model,
tokenizer = tokenizer,
classnames = IMAGENET_CLASSNAMES , # 1000 classes
templates = OPENAI_IMAGENET_TEMPLATES , # 7 templates
num_classes_per_batch = 10 ,
device = device,
use_tqdm = True ,
)
logging.info( 'Using classifier' )
results = {}
if 'imagenet-val' in data:
top1, top5 = run(model, classifier, data[ 'imagenet-val' ].dataloader, args)
results[ 'imagenet-zeroshot-val-top1' ] = top1
results[ 'imagenet-zeroshot-val-top5' ] = top5
return results
What’s happening:
Model has never seen ImageNet classification task during training
Build classifier from 1000 ImageNet class names using 7 prompt templates
Evaluate on ImageNet validation set
Achieve competitive accuracy without task-specific fine-tuning!
OpenAI’s ImageNet Templates
Used in the original CLIP paper:
OPENAI_IMAGENET_TEMPLATES = [
'a bad photo of a {} .' ,
'a photo of many {} .' ,
'a sculpture of a {} .' ,
'a photo of the hard to see {} .' ,
'a low resolution photo of the {} .' ,
'a rendering of a {} .' ,
'graffiti of a {} .' ,
'a bad photo of the {} .' ,
'a cropped photo of the {} .' ,
# ... (80+ total templates)
]
Multiple templates help capture diverse visual contexts.
Practical Usage Example
Custom Classification
Classify an image into custom categories:
import torch
import open_clip
from PIL import Image
# Load model
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-B-32' , pretrained = 'laion2b_s34b_b79k'
)
model.eval()
tokenizer = open_clip.get_tokenizer( 'ViT-B-32' )
# Define custom classes
classnames = [ 'dog' , 'cat' , 'car' , 'airplane' ]
templates = [ 'a photo of a {} .' ]
# Build zero-shot classifier
with torch.no_grad():
text_prompts = [template.format(c) for c in classnames for template in templates]
text_tokens = tokenizer(text_prompts)
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm( dim =- 1 , keepdim = True )
# Classifier weights: [embed_dim, num_classes]
classifier = text_features.T
# Classify image
image = preprocess(Image.open( 'dog.jpg' )).unsqueeze( 0 )
with torch.no_grad():
image_features = model.encode_image(image)
image_features /= image_features.norm( dim =- 1 , keepdim = True )
# Compute similarities
logits = 100.0 * image_features @ classifier
probs = logits.softmax( dim =- 1 )
# Get prediction
pred_idx = probs.argmax()
print ( f "Predicted: { classnames[pred_idx] } ( { probs[ 0 , pred_idx] :.2%} )" )
# Output: Predicted: dog (94.23%)
Zero-Shot vs Fine-Tuning
Zero-Shot (No Fine-Tuning)
✅ Advantages:
Works on any categories without training data
Instant deployment to new tasks
No overfitting to specific datasets
Leverages large-scale pretraining
❌ Limitations:
Lower accuracy than fine-tuned models on specific tasks
Sensitive to prompt engineering
May struggle with fine-grained distinctions
With Fine-Tuning
✅ Advantages:
Higher accuracy on target task
Adapts to specific visual distributions
Can learn task-specific features
❌ Limitations:
Requires labeled training data
May lose zero-shot generalization
Risk of overfitting
For fine-tuning CLIP, see the WiSE-FT repository which implements robust fine-tuning techniques.
Advanced Techniques
Prompt Engineering
Better prompts → better performance:
# Generic
prompts = [ "dog" , "cat" ]
# Better: add context
prompts = [ "a photo of a dog" , "a photo of a cat" ]
# Best: diverse templates
templates = [
"a photo of a {} " ,
"a picture of a {} " ,
"an image showing a {} " ,
"a rendering of a {} " ,
]
Ensemble Multiple Templates
Averaging embeddings across templates improves robustness (already done in build_zero_shot_classifier).
Hierarchical Classification
For fine-grained tasks, use two-stage classification:
Coarse categories: “bird”, “mammal”, “vehicle”
Fine-grained: “golden retriever”, “labrador”, “poodle”
From the README, OpenCLIP models achieve strong zero-shot ImageNet accuracy:
Model Training Data Zero-Shot ImageNet Acc ViT-B-16 DataComp-1B 73.5% ViT-L-14 DataComp-1B 79.2% ViT-bigG-14 LAION-2B 80.1% ViT-SO400M (SigLIP) WebLI 82.0% ViT-gopt-16 (SigLIP2) WebLI 85.0%
Without any ImageNet-specific training!
Key Takeaways
Zero-shot = Similarity search : Classification as nearest neighbor in embedding space
Prompts matter : “a photo of a dog” > “dog”
Template ensembling : Average across multiple prompts for robustness
Temperature scaling : Controls prediction sharpness
No training data needed : Instant deployment to new categories
Trade-off : Convenience vs accuracy (compared to fine-tuning)
Reference Files
src/open_clip/zero_shot_classifier.py - Classifier building logic
src/open_clip_train/zero_shot.py - Zero-shot evaluation during training
src/open_clip/zero_shot_metadata.py - ImageNet classnames and templates
CLIP Overview Understanding the dual encoder architecture
Contrastive Learning How CLIP learns aligned embeddings
Further Reading