Skip to main content

Zero-Shot Classification

One of CLIP’s most powerful capabilities is zero-shot classification: the ability to classify images into categories the model has never been explicitly trained on. This is achieved by comparing image embeddings with text embeddings of potential class labels.

Core Concept

Instead of learning a fixed classifier head for specific categories, CLIP:
  1. Encodes the image into an embedding vector
  2. Encodes candidate text labels (e.g., “a photo of a dog”) into embedding vectors
  3. Computes similarity scores between the image and each text embedding
  4. Selects the highest scoring label as the prediction
Key Insight: Classification becomes a similarity search problem in the joint embedding space, not a traditional softmax over learned weights.

How It Works

Step 1: Prepare Text Prompts

Convert class names into descriptive text prompts using templates:
classnames = ["dog", "cat", "bird"]
templates = [
    "a photo of a {}.",
    "a picture of a {}.",
    "an image of a {}.",
]

# Generate prompts
prompts = [
    "a photo of a dog.", "a picture of a dog.", "an image of a dog.",
    "a photo of a cat.", "a picture of a cat.", "an image of a cat.",
    "a photo of a bird.", "a picture of a bird.", "an image of a bird.",
]
Why templates? Context matters! “a photo of a dog” provides more semantic information than just “dog”, leading to better embeddings.

Step 2: Build Zero-Shot Classifier Weights

From src/open_clip/zero_shot_classifier.py:21-68:
def build_zero_shot_classifier(
        model,
        tokenizer,
        classnames: Sequence[str],
        templates: Sequence[Union[Callable, str]],
        num_classes_per_batch: Optional[int] = 10,
        device: Union[str, torch.device] = 'cpu',
        use_tqdm: bool = False,
):
    """ Build zero-shot classifier weights by iterating over class names in batches """
    
    def _process_batch(batch_classnames):
        num_batch_classes = len(batch_classnames)
        
        # Generate all text prompts for this batch
        texts = [template.format(c) if use_format else template(c) 
                 for c in batch_classnames for template in templates]
        
        # Tokenize and encode
        texts = tokenizer(texts).to(device)
        class_embeddings = model.encode_text(texts, normalize=True)
        
        # Average embeddings across templates
        class_embeddings = class_embeddings.reshape(
            num_batch_classes, num_templates, -1
        ).mean(dim=1)
        
        # Re-normalize after averaging
        class_embeddings = class_embeddings / class_embeddings.norm(dim=1, keepdim=True)
        class_embeddings = class_embeddings.T  # Shape: [embed_dim, num_classes]
        return class_embeddings

    with torch.no_grad():
        if num_classes_per_batch:
            batched_embeds = [_process_batch(batch) 
                            for batch in batched(classnames, num_classes_per_batch)]
            zeroshot_weights = torch.cat(batched_embeds, dim=1)
        else:
            zeroshot_weights = _process_batch(classnames)
    
    return zeroshot_weights
Key steps:
  1. Generate prompts for each class using multiple templates
  2. Encode all prompts to get text embeddings
  3. Average embeddings across templates for each class (ensemble)
  4. Normalize to unit length
  5. Transpose to shape [embed_dim, num_classes]

Step 3: Classify Images

From src/open_clip_train/zero_shot.py:17-42:
def run(model, classifier, dataloader, args):
    """ Run zero-shot classification on a dataset """
    device = torch.device(args.device)
    
    with torch.inference_mode():
        top1, top5, n = 0., 0., 0.
        for images, target in tqdm(dataloader, unit_scale=args.batch_size):
            images = images.to(device=device)
            target = target.to(device)
            
            with autocast():
                # Encode image
                output = model(image=images)
                image_features = output['image_features'] if isinstance(output, dict) else output[0]
                
                # Compute similarity with classifier weights
                logits = 100. * image_features @ classifier
            
            # Measure accuracy
            acc1, acc5 = accuracy(logits, target, topk=(1, 5))
            top1 += acc1
            top5 += acc5
            n += images.size(0)
    
    return top1 / n, top5 / n
Classification process:
  1. Encode image → normalized embedding vector
  2. Matrix multiply with classifier weights: logits = image_features @ zeroshot_weights
  3. Scale by 100 (temperature scaling)
  4. Argmax to get predicted class

Temperature Scaling and Similarity Computation

Cosine Similarity

Since both image and text embeddings are L2-normalized, their dot product equals cosine similarity:
similarity = image_features @ text_features.T
           = ||image|| * ||text|| * cos(θ)
           = 1 * 1 * cos(θ)  [since normalized]
           = cos(θ)
Values range from -1 (opposite) to +1 (identical).

Temperature Scaling

The scaling factor (100.0 in the example) controls prediction confidence:
logits = temperature * image_features @ classifier
  • Higher temperature → sharper probability distribution, more confident predictions
  • Lower temperature → softer distribution, less confident predictions
From the CLIP model (src/open_clip/model.py:274-298):
class CLIP(nn.Module):
    def __init__(
            self,
            embed_dim: int,
            ...
            init_logit_scale: float = np.log(1 / 0.07),  # ≈ 2.66
            ...
    ):
        self.logit_scale = nn.Parameter(torch.ones([]) * init_logit_scale)
During training, logit_scale is learned. At inference:
temperature = self.logit_scale.exp()  # Typically ~100
logits = temperature * image_features @ text_features.T

Softmax Probabilities

To get class probabilities:
probs = logits.softmax(dim=-1)
# probs[i] = probability that image belongs to class i

Real Example from Codebase

ImageNet Zero-Shot Evaluation

From src/open_clip_train/zero_shot.py:45-86:
def zero_shot_eval(model, data, epoch, args, tokenizer=None):
    """ Evaluate zero-shot ImageNet classification during training """
    
    if tokenizer is None:
        tokenizer = get_tokenizer(args.model)

    logging.info('Building zero-shot classifier')
    device = torch.device(args.device)
    
    with autocast():
        # Build classifier using ImageNet-1K class names
        classifier = build_zero_shot_classifier(
            model,
            tokenizer=tokenizer,
            classnames=IMAGENET_CLASSNAMES,  # 1000 classes
            templates=OPENAI_IMAGENET_TEMPLATES,  # 7 templates
            num_classes_per_batch=10,
            device=device,
            use_tqdm=True,
        )

    logging.info('Using classifier')
    results = {}
    
    if 'imagenet-val' in data:
        top1, top5 = run(model, classifier, data['imagenet-val'].dataloader, args)
        results['imagenet-zeroshot-val-top1'] = top1
        results['imagenet-zeroshot-val-top5'] = top5
    
    return results
What’s happening:
  1. Model has never seen ImageNet classification task during training
  2. Build classifier from 1000 ImageNet class names using 7 prompt templates
  3. Evaluate on ImageNet validation set
  4. Achieve competitive accuracy without task-specific fine-tuning!

OpenAI’s ImageNet Templates

Used in the original CLIP paper:
OPENAI_IMAGENET_TEMPLATES = [
    'a bad photo of a {}.',
    'a photo of many {}.',
    'a sculpture of a {}.',
    'a photo of the hard to see {}.',
    'a low resolution photo of the {}.',
    'a rendering of a {}.',
    'graffiti of a {}.',
    'a bad photo of the {}.',
    'a cropped photo of the {}.',
    # ... (80+ total templates)
]
Multiple templates help capture diverse visual contexts.

Practical Usage Example

Custom Classification

Classify an image into custom categories:
import torch
import open_clip
from PIL import Image

# Load model
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', pretrained='laion2b_s34b_b79k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Define custom classes
classnames = ['dog', 'cat', 'car', 'airplane']
templates = ['a photo of a {}.']

# Build zero-shot classifier
with torch.no_grad():
    text_prompts = [template.format(c) for c in classnames for template in templates]
    text_tokens = tokenizer(text_prompts)
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Classifier weights: [embed_dim, num_classes]
    classifier = text_features.T

# Classify image
image = preprocess(Image.open('dog.jpg')).unsqueeze(0)
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    
    # Compute similarities
    logits = 100.0 * image_features @ classifier
    probs = logits.softmax(dim=-1)
    
    # Get prediction
    pred_idx = probs.argmax()
    print(f"Predicted: {classnames[pred_idx]} ({probs[0, pred_idx]:.2%})")
    # Output: Predicted: dog (94.23%)

Zero-Shot vs Fine-Tuning

Zero-Shot (No Fine-Tuning)

Advantages:
  • Works on any categories without training data
  • Instant deployment to new tasks
  • No overfitting to specific datasets
  • Leverages large-scale pretraining
Limitations:
  • Lower accuracy than fine-tuned models on specific tasks
  • Sensitive to prompt engineering
  • May struggle with fine-grained distinctions

With Fine-Tuning

Advantages:
  • Higher accuracy on target task
  • Adapts to specific visual distributions
  • Can learn task-specific features
Limitations:
  • Requires labeled training data
  • May lose zero-shot generalization
  • Risk of overfitting
For fine-tuning CLIP, see the WiSE-FT repository which implements robust fine-tuning techniques.

Advanced Techniques

Prompt Engineering

Better prompts → better performance:
# Generic
prompts = ["dog", "cat"]

# Better: add context
prompts = ["a photo of a dog", "a photo of a cat"]

# Best: diverse templates
templates = [
    "a photo of a {}",
    "a picture of a {}", 
    "an image showing a {}",
    "a rendering of a {}",
]

Ensemble Multiple Templates

Averaging embeddings across templates improves robustness (already done in build_zero_shot_classifier).

Hierarchical Classification

For fine-grained tasks, use two-stage classification:
  1. Coarse categories: “bird”, “mammal”, “vehicle”
  2. Fine-grained: “golden retriever”, “labrador”, “poodle”

Performance Benchmarks

From the README, OpenCLIP models achieve strong zero-shot ImageNet accuracy:
ModelTraining DataZero-Shot ImageNet Acc
ViT-B-16DataComp-1B73.5%
ViT-L-14DataComp-1B79.2%
ViT-bigG-14LAION-2B80.1%
ViT-SO400M (SigLIP)WebLI82.0%
ViT-gopt-16 (SigLIP2)WebLI85.0%
Without any ImageNet-specific training!

Key Takeaways

  1. Zero-shot = Similarity search: Classification as nearest neighbor in embedding space
  2. Prompts matter: “a photo of a dog” > “dog”
  3. Template ensembling: Average across multiple prompts for robustness
  4. Temperature scaling: Controls prediction sharpness
  5. No training data needed: Instant deployment to new categories
  6. Trade-off: Convenience vs accuracy (compared to fine-tuning)

Reference Files

  • src/open_clip/zero_shot_classifier.py - Classifier building logic
  • src/open_clip_train/zero_shot.py - Zero-shot evaluation during training
  • src/open_clip/zero_shot_metadata.py - ImageNet classnames and templates

CLIP Overview

Understanding the dual encoder architecture

Contrastive Learning

How CLIP learns aligned embeddings

Further Reading

Build docs developers (and LLMs) love