Zero-Shot Classification

One of CLIP’s most powerful capabilities is zero-shot classification: the ability to classify images into categories the model has never been explicitly trained on. This is achieved by comparing image embeddings with text embeddings of potential class labels.

Core Concept

Instead of learning a fixed classifier head for specific categories, CLIP:

Encodes the image into an embedding vector
Encodes candidate text labels (e.g., “a photo of a dog”) into embedding vectors
Computes similarity scores between the image and each text embedding
Selects the highest scoring label as the prediction

Key Insight: Classification becomes a similarity search problem in the joint embedding space, not a traditional softmax over learned weights.

How It Works

Step 1: Prepare Text Prompts

Convert class names into descriptive text prompts using templates:

classnames = ["dog", "cat", "bird"]
templates = [
    "a photo of a {}.",
    "a picture of a {}.",
    "an image of a {}.",
]

# Generate prompts
prompts = [
    "a photo of a dog.", "a picture of a dog.", "an image of a dog.",
    "a photo of a cat.", "a picture of a cat.", "an image of a cat.",
    "a photo of a bird.", "a picture of a bird.", "an image of a bird.",
]

Why templates? Context matters! “a photo of a dog” provides more semantic information than just “dog”, leading to better embeddings.

Step 2: Build Zero-Shot Classifier Weights

From src/open_clip/zero_shot_classifier.py:21-68:

def build_zero_shot_classifier(
        model,
        tokenizer,
        classnames: Sequence[str],
        templates: Sequence[Union[Callable, str]],
        num_classes_per_batch: Optional[int] = 10,
        device: Union[str, torch.device] = 'cpu',
        use_tqdm: bool = False,
):
    """ Build zero-shot classifier weights by iterating over class names in batches """
    
    def _process_batch(batch_classnames):
        num_batch_classes = len(batch_classnames)
        
        # Generate all text prompts for this batch
        texts = [template.format(c) if use_format else template(c) 
                 for c in batch_classnames for template in templates]
        
        # Tokenize and encode
        texts = tokenizer(texts).to(device)
        class_embeddings = model.encode_text(texts, normalize=True)
        
        # Average embeddings across templates
        class_embeddings = class_embeddings.reshape(
            num_batch_classes, num_templates, -1
        ).mean(dim=1)
        
        # Re-normalize after averaging
        class_embeddings = class_embeddings / class_embeddings.norm(dim=1, keepdim=True)
        class_embeddings = class_embeddings.T  # Shape: [embed_dim, num_classes]
        return class_embeddings

    with torch.no_grad():
        if num_classes_per_batch:
            batched_embeds = [_process_batch(batch) 
                            for batch in batched(classnames, num_classes_per_batch)]
            zeroshot_weights = torch.cat(batched_embeds, dim=1)
        else:
            zeroshot_weights = _process_batch(classnames)
    
    return zeroshot_weights

Key steps:

Generate prompts for each class using multiple templates
Encode all prompts to get text embeddings
Average embeddings across templates for each class (ensemble)
Normalize to unit length
Transpose to shape [embed_dim, num_classes]

Step 3: Classify Images

From src/open_clip_train/zero_shot.py:17-42:

def run(model, classifier, dataloader, args):
    """ Run zero-shot classification on a dataset """
    device = torch.device(args.device)
    
    with torch.inference_mode():
        top1, top5, n = 0., 0., 0.
        for images, target in tqdm(dataloader, unit_scale=args.batch_size):
            images = images.to(device=device)
            target = target.to(device)
            
            with autocast():
                # Encode image
                output = model(image=images)
                image_features = output['image_features'] if isinstance(output, dict) else output[0]
                
                # Compute similarity with classifier weights
                logits = 100. * image_features @ classifier
            
            # Measure accuracy
            acc1, acc5 = accuracy(logits, target, topk=(1, 5))
            top1 += acc1
            top5 += acc5
            n += images.size(0)
    
    return top1 / n, top5 / n

Classification process:

Encode image → normalized embedding vector
Matrix multiply with classifier weights: logits = image_features @ zeroshot_weights
Scale by 100 (temperature scaling)
Argmax to get predicted class

Temperature Scaling and Similarity Computation

Cosine Similarity

Since both image and text embeddings are L2-normalized, their dot product equals cosine similarity:

similarity = image_features @ text_features.T
           = ||image|| * ||text|| * cos(θ)
           = 1 * 1 * cos(θ)  [since normalized]
           = cos(θ)

Values range from -1 (opposite) to +1 (identical).

Temperature Scaling

The scaling factor (100.0 in the example) controls prediction confidence:

logits = temperature * image_features @ classifier

Higher temperature → sharper probability distribution, more confident predictions
Lower temperature → softer distribution, less confident predictions

From the CLIP model (src/open_clip/model.py:274-298):

class CLIP(nn.Module):
    def __init__(
            self,
            embed_dim: int,
            ...
            init_logit_scale: float = np.log(1 / 0.07),  # ≈ 2.66
            ...
    ):
        self.logit_scale = nn.Parameter(torch.ones([]) * init_logit_scale)

During training, logit_scale is learned. At inference:

temperature = self.logit_scale.exp()  # Typically ~100
logits = temperature * image_features @ text_features.T

Softmax Probabilities

To get class probabilities:

probs = logits.softmax(dim=-1)
# probs[i] = probability that image belongs to class i

Real Example from Codebase

ImageNet Zero-Shot Evaluation

From src/open_clip_train/zero_shot.py:45-86:

def zero_shot_eval(model, data, epoch, args, tokenizer=None):
    """ Evaluate zero-shot ImageNet classification during training """
    
    if tokenizer is None:
        tokenizer = get_tokenizer(args.model)

    logging.info('Building zero-shot classifier')
    device = torch.device(args.device)
    
    with autocast():
        # Build classifier using ImageNet-1K class names
        classifier = build_zero_shot_classifier(
            model,
            tokenizer=tokenizer,
            classnames=IMAGENET_CLASSNAMES,  # 1000 classes
            templates=OPENAI_IMAGENET_TEMPLATES,  # 7 templates
            num_classes_per_batch=10,
            device=device,
            use_tqdm=True,
        )

    logging.info('Using classifier')
    results = {}
    
    if 'imagenet-val' in data:
        top1, top5 = run(model, classifier, data['imagenet-val'].dataloader, args)
        results['imagenet-zeroshot-val-top1'] = top1
        results['imagenet-zeroshot-val-top5'] = top5
    
    return results

What’s happening:

Model has never seen ImageNet classification task during training
Build classifier from 1000 ImageNet class names using 7 prompt templates
Evaluate on ImageNet validation set
Achieve competitive accuracy without task-specific fine-tuning!

OpenAI’s ImageNet Templates

Used in the original CLIP paper:

OPENAI_IMAGENET_TEMPLATES = [
    'a bad photo of a {}.',
    'a photo of many {}.',
    'a sculpture of a {}.',
    'a photo of the hard to see {}.',
    'a low resolution photo of the {}.',
    'a rendering of a {}.',
    'graffiti of a {}.',
    'a bad photo of the {}.',
    'a cropped photo of the {}.',
    # ... (80+ total templates)
]

Multiple templates help capture diverse visual contexts.

Practical Usage Example

Custom Classification

Classify an image into custom categories:

import torch
import open_clip
from PIL import Image

# Load model
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', pretrained='laion2b_s34b_b79k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Define custom classes
classnames = ['dog', 'cat', 'car', 'airplane']
templates = ['a photo of a {}.']

# Build zero-shot classifier
with torch.no_grad():
    text_prompts = [template.format(c) for c in classnames for template in templates]
    text_tokens = tokenizer(text_prompts)
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Classifier weights: [embed_dim, num_classes]
    classifier = text_features.T

# Classify image
image = preprocess(Image.open('dog.jpg')).unsqueeze(0)
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    
    # Compute similarities
    logits = 100.0 * image_features @ classifier
    probs = logits.softmax(dim=-1)
    
    # Get prediction
    pred_idx = probs.argmax()
    print(f"Predicted: {classnames[pred_idx]} ({probs[0, pred_idx]:.2%})")
    # Output: Predicted: dog (94.23%)

Zero-Shot vs Fine-Tuning

Zero-Shot (No Fine-Tuning)

✅ Advantages:

Works on any categories without training data
Instant deployment to new tasks
No overfitting to specific datasets
Leverages large-scale pretraining

❌ Limitations:

Lower accuracy than fine-tuned models on specific tasks
Sensitive to prompt engineering
May struggle with fine-grained distinctions

With Fine-Tuning

✅ Advantages:

Higher accuracy on target task
Adapts to specific visual distributions
Can learn task-specific features

❌ Limitations:

Requires labeled training data
May lose zero-shot generalization
Risk of overfitting

For fine-tuning CLIP, see the WiSE-FT repository which implements robust fine-tuning techniques.

Advanced Techniques

Prompt Engineering

Better prompts → better performance:

# Generic
prompts = ["dog", "cat"]

# Better: add context
prompts = ["a photo of a dog", "a photo of a cat"]

# Best: diverse templates
templates = [
    "a photo of a {}",
    "a picture of a {}", 
    "an image showing a {}",
    "a rendering of a {}",
]

Ensemble Multiple Templates

Averaging embeddings across templates improves robustness (already done in build_zero_shot_classifier).

Hierarchical Classification

For fine-grained tasks, use two-stage classification:

Coarse categories: “bird”, “mammal”, “vehicle”
Fine-grained: “golden retriever”, “labrador”, “poodle”

Performance Benchmarks

From the README, OpenCLIP models achieve strong zero-shot ImageNet accuracy:

Model	Training Data	Zero-Shot ImageNet Acc
ViT-B-16	DataComp-1B	73.5%
ViT-L-14	DataComp-1B	79.2%
ViT-bigG-14	LAION-2B	80.1%
ViT-SO400M (SigLIP)	WebLI	82.0%
ViT-gopt-16 (SigLIP2)	WebLI	85.0%

Without any ImageNet-specific training!

Key Takeaways

Zero-shot = Similarity search: Classification as nearest neighbor in embedding space
Prompts matter: “a photo of a dog” > “dog”
Template ensembling: Average across multiple prompts for robustness
Temperature scaling: Controls prediction sharpness
No training data needed: Instant deployment to new categories
Trade-off: Convenience vs accuracy (compared to fine-tuning)

Reference Files

src/open_clip/zero_shot_classifier.py - Classifier building logic
src/open_clip_train/zero_shot.py - Zero-shot evaluation during training
src/open_clip/zero_shot_metadata.py - ImageNet classnames and templates

CLIP Overview

Understanding the dual encoder architecture

Contrastive Learning

How CLIP learns aligned embeddings

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

Zero-Shot Classification

Zero-Shot Classification

Core Concept

How It Works

Step 1: Prepare Text Prompts

Step 2: Build Zero-Shot Classifier Weights

Step 3: Classify Images

Temperature Scaling and Similarity Computation

Cosine Similarity

Temperature Scaling

Softmax Probabilities

Real Example from Codebase

ImageNet Zero-Shot Evaluation

OpenAI’s ImageNet Templates

Practical Usage Example

Custom Classification

Zero-Shot vs Fine-Tuning

Zero-Shot (No Fine-Tuning)

With Fine-Tuning

Advanced Techniques

Prompt Engineering

Ensemble Multiple Templates

Hierarchical Classification

Performance Benchmarks

Key Takeaways

Reference Files

CLIP Overview

Contrastive Learning

Further Reading

Build docs developers (and LLMs) love

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Zero-Shot Classification

​Core Concept

​How It Works

​Step 1: Prepare Text Prompts

​Step 2: Build Zero-Shot Classifier Weights

​Step 3: Classify Images

​Temperature Scaling and Similarity Computation

​Cosine Similarity

​Temperature Scaling

​Softmax Probabilities

​Real Example from Codebase

​ImageNet Zero-Shot Evaluation

​OpenAI’s ImageNet Templates

​Practical Usage Example

​Custom Classification

​Zero-Shot vs Fine-Tuning

​Zero-Shot (No Fine-Tuning)

​With Fine-Tuning

​Advanced Techniques

​Prompt Engineering

​Ensemble Multiple Templates

​Hierarchical Classification

​Performance Benchmarks

​Key Takeaways

​Reference Files

​Related Concepts

CLIP Overview

Contrastive Learning

​Further Reading

Build docs developers (and LLMs) love

Zero-Shot Classification

Core Concept

How It Works

Step 1: Prepare Text Prompts

Step 2: Build Zero-Shot Classifier Weights

Step 3: Classify Images

Temperature Scaling and Similarity Computation

Cosine Similarity

Temperature Scaling

Softmax Probabilities

Real Example from Codebase

ImageNet Zero-Shot Evaluation

OpenAI’s ImageNet Templates

Practical Usage Example

Custom Classification

Zero-Shot vs Fine-Tuning

Zero-Shot (No Fine-Tuning)

With Fine-Tuning

Advanced Techniques

Prompt Engineering

Ensemble Multiple Templates

Hierarchical Classification

Performance Benchmarks

Key Takeaways

Reference Files

Related Concepts

Further Reading