Skip to main content

Zero-Shot Evaluation

OpenCLIP supports zero-shot evaluation, where models are tested on classification tasks without any task-specific fine-tuning. This is one of CLIP’s key capabilities.

Zero-Shot Evaluation During Training

You can run zero-shot ImageNet evaluation automatically during training using the --zeroshot-frequency flag.

Setup

To enable zero-shot evaluation during training, you need:
  1. ImageNet validation set: Path to the validation split (not training set)
  2. Zeroshot frequency: How often to run evaluation (in epochs)

Command Example

python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data="/path/to/train_data.csv" \
    --val-data="/path/to/validation_data.csv" \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val=/path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=8 \
    --model RN50
The --imagenet-val path should point to the validation set of ImageNet, not the training set. The validation folder should contain subfolders for each class. If it doesn’t, use this script to organize it.

Parameters

  • --zeroshot-frequency N: Run zero-shot evaluation every N epochs. Set to 0 to disable.
  • --imagenet-val PATH: Path to ImageNet validation set for zero-shot evaluation

How Zero-Shot Evaluation Works

The zero-shot evaluation process in OpenCLIP follows these steps:

1. Building the Zero-Shot Classifier

The classifier is built by encoding text prompts for all ImageNet classes:
from open_clip import build_zero_shot_classifier, IMAGENET_CLASSNAMES, OPENAI_IMAGENET_TEMPLATES

# Build classifier using class names and prompt templates
classifier = build_zero_shot_classifier(
    model,
    tokenizer=tokenizer,
    classnames=IMAGENET_CLASSNAMES,
    templates=OPENAI_IMAGENET_TEMPLATES,
    num_classes_per_batch=10,
    device=device,
    use_tqdm=True,
)
The classifier uses OpenAI’s prompt templates (e.g., “a photo of a {class}”, “a picture of a {class}”, etc.) to create multiple text descriptions for each class.

2. Computing Image Features

For each image in the validation set:
# Extract image features
with torch.no_grad():
    output = model(image=images)
    image_features = output['image_features']
    
    # Compute similarity with text classifier
    logits = 100. * image_features @ classifier

3. Computing Accuracy

Accuracy is computed using top-1 and top-5 metrics:
def accuracy(output, target, topk=(1, 5)):
    pred = output.topk(max(topk), 1, True, True)[1].t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))
    return [float(correct[:k].reshape(-1).float().sum(0)) for k in topk]

Evaluating Pre-trained Checkpoints

Local Checkpoint

Evaluate a local checkpoint on ImageNet:
python -m open_clip_train.main \
    --imagenet-val /path/to/imagenet/validation \
    --model RN101 \
    --pretrained /path/to/checkpoints/epoch_K.pt

Hosted Checkpoint

Evaluate a pre-trained model from the model zoo:
python -m open_clip_train.main \
    --imagenet-val /path/to/imagenet/validation \
    --model ViT-B-32-quickgelu \
    --pretrained laion400m_e32

Hugging Face Checkpoint

You can also evaluate checkpoints from Hugging Face:
python -m open_clip_train.main \
    --imagenet-val /path/to/imagenet/validation \
    --model ViT-L-14 \
    --pretrained /path/to/open_clip_pytorch_model.bin

Systematic Evaluation with CLIP_benchmark

For comprehensive evaluation across multiple datasets, we recommend using CLIP_benchmark:
# Install CLIP_benchmark
pip install clip-benchmark

# Run evaluation on multiple datasets
clip_benchmark eval --model ViT-B-32 --pretrained laion2b_s34b_b79k \
    --dataset imagenet1k,cifar10,cifar100 \
    --output results.json
CLIP_benchmark supports:
  • 40+ datasets for zero-shot classification
  • Retrieval tasks (image-to-text and text-to-image)
  • Multiple languages for multilingual models
  • Standardized metrics for fair comparison

Evaluation Metrics

During zero-shot evaluation, the following metrics are logged:
MetricDescription
imagenet-zeroshot-val-top1Top-1 accuracy on ImageNet validation set
imagenet-zeroshot-val-top5Top-5 accuracy on ImageNet validation set
imagenetv2-zeroshot-val-top1Top-1 accuracy on ImageNet-V2 (if available)
imagenetv2-zeroshot-val-top5Top-5 accuracy on ImageNet-V2 (if available)

Example Results During Training

When training with zero-shot evaluation enabled, you’ll see output like:
Starting zero-shot imagenet.
Building zero-shot classifier
Using classifier
Finished zero-shot imagenet.
imagenet-zeroshot-val-top1: 0.6332
imagenet-zeroshot-val-top5: 0.8758
These metrics are automatically logged to your chosen logging backend (TensorBoard, Weights & Biases, etc.).

Best Practices

Frequency: For large-scale training, evaluate every 1-5 epochs. Zero-shot evaluation adds minimal overhead.
Validation Set: Always use the ImageNet validation set, not the training set. The validation set has 50,000 images across 1,000 classes.
Data Loading: The validation folder should be organized with one subfolder per class. Use the valprep script if needed.
Memory: Zero-shot evaluation requires storing the text classifier in memory. For models with large text encoders, this may require additional GPU memory.

Implementation Details

Source Code

The zero-shot evaluation implementation is in src/open_clip_train/zero_shot.py. Key functions:
  • zero_shot_eval(): Main evaluation function called during training
  • run(): Runs inference on the validation set
  • accuracy(): Computes top-k accuracy

Supported Datasets

By default, zero-shot evaluation supports:
  • ImageNet-1k (ILSVRC2012 validation set)
  • ImageNet-V2 (matched frequency variant)
For evaluation on additional datasets, use CLIP_benchmark.

Next Steps

Build docs developers (and LLMs) love