Evaluation Metrics

OpenCLIP uses several standard metrics to evaluate model performance on zero-shot classification and retrieval tasks.

Classification Metrics

Top-1 Accuracy

Top-1 accuracy is the primary metric for classification tasks. It measures the percentage of samples where the model’s highest-confidence prediction matches the ground truth label.

def accuracy(output, target, topk=(1,)):
    """Compute top-k accuracy"""
    pred = output.topk(max(topk), 1, True, True)[1].t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))
    return [float(correct[:k].reshape(-1).float().sum(0)) for k in topk]

Formula:

Top-1 Accuracy = (Number of correct predictions) / (Total number of samples)

Example: If a model correctly classifies 633 out of 1000 images, its top-1 accuracy is 63.3%.

Top-5 Accuracy

Top-5 accuracy is more lenient—it considers a prediction correct if the ground truth label appears in the model’s top 5 predictions. Formula:

Top-5 Accuracy = (Samples with correct label in top 5) / (Total number of samples)

Use Case: Top-5 is useful when:

Classes are visually similar (e.g., dog breeds)
The task has high inherent ambiguity
Comparing models that might have similar top-1 but different top-5 performance

Per-Class vs. Overall Accuracy

OpenCLIP reports overall accuracy averaged across all samples. For class-imbalanced datasets, you might also want to compute per-class accuracy and take the mean.

Zero-Shot Accuracy Computation

Zero-shot accuracy in OpenCLIP is computed as follows:

1. Text Classifier Construction

For each class, generate multiple text embeddings using prompt templates:

from open_clip import IMAGENET_CLASSNAMES, OPENAI_IMAGENET_TEMPLATES

# Example class: "golden retriever"
classname = "golden retriever"
templates = [
    "a photo of a {}.",
    "a picture of a {}.",
    "an image of a {}.",
    # ... 80 templates total
]

# Encode all templates for this class
text_features = []
for template in templates:
    text = template.format(classname)
    text_features.append(model.encode_text(tokenize(text)))

# Average the features
class_embedding = torch.mean(torch.stack(text_features), dim=0)
class_embedding /= class_embedding.norm()  # L2 normalize

2. Image Encoding

Encode the test image:

image_features = model.encode_image(preprocess(image))
image_features /= image_features.norm()  # L2 normalize

3. Similarity Computation

Compute cosine similarity between image and all class embeddings:

logits = 100.0 * image_features @ class_embeddings.T
probs = logits.softmax(dim=-1)
predicted_class = probs.argmax()

The temperature scaling factor of 100.0 is used to sharpen the probability distribution.

4. Accuracy Calculation

Compare predictions to ground truth:

top1_count = 0
top5_count = 0
total = 0

for image, label in dataloader:
    logits = compute_logits(model, image, class_embeddings)
    
    # Top-1
    pred = logits.argmax(dim=-1)
    top1_count += (pred == label).sum().item()
    
    # Top-5
    top5_preds = logits.topk(5, dim=-1)[1]
    top5_count += (top5_preds == label.unsqueeze(-1)).any(dim=-1).sum().item()
    
    total += len(label)

top1_accuracy = top1_count / total
top5_accuracy = top5_count / total

Retrieval Metrics

For image-text retrieval tasks (like Flickr30k and MSCOCO), OpenCLIP uses standard retrieval metrics:

Recall@K

Recall@K measures the percentage of queries where the correct item appears in the top K retrieved results. Formula:

Recall@K = (Queries with correct item in top K) / (Total queries)

Common values:

R@1: Strictest metric (correct item must be rank 1)
R@5: Correct item in top 5
R@10: Correct item in top 10

Image-to-Text Retrieval

Given an image, retrieve relevant text captions:

image_features = model.encode_image(images)  # [B, D]
text_features = model.encode_text(texts)      # [N, D]

# Compute similarities
similarity = image_features @ text_features.T  # [B, N]

# For each image, find top K texts
top_k_indices = similarity.topk(k=5, dim=1)[1]  # [B, K]

# Check if correct caption is in top K
recall_at_5 = (top_k_indices == ground_truth_idx.unsqueeze(1)).any(dim=1).float().mean()

Text-to-Image Retrieval

Given a text query, retrieve relevant images:

# Transpose similarity matrix
similarity = text_features @ image_features.T  # [N, B]

# For each text, find top K images
top_k_indices = similarity.topk(k=5, dim=1)[1]  # [N, K]

# Check if correct image is in top K
recall_at_5 = (top_k_indices == ground_truth_idx.unsqueeze(1)).any(dim=1).float().mean()

Mean Rank

Mean rank measures the average position of the correct item in the ranked list:

ranks = []
for query_idx in range(len(queries)):
    similarity_scores = compute_similarity(query_idx, all_items)
    rank = (similarity_scores.argsort(descending=True) == ground_truth[query_idx]).nonzero()[0]
    ranks.append(rank.item() + 1)  # 1-indexed

mean_rank = sum(ranks) / len(ranks)

Lower mean rank is better.

Aggregate Metrics

Average Performance Across Datasets

The “Average perf. on 38 datasets” metric in our results is computed as:

average_performance = sum(accuracy_per_dataset) / num_datasets

This provides a single number summarizing model performance across the diverse evaluation suite.

Weighted Average

Some benchmarks use weighted averages where larger datasets have more influence:

weighted_avg = sum(accuracy[i] * size[i] for i in datasets) / sum(size)

OpenCLIP typically reports unweighted averages to give equal importance to each dataset.

Logging Metrics

OpenCLIP automatically logs metrics to your configured logging backend during training.

TensorBoard

Enable TensorBoard logging:

python -m open_clip_train.main \
    --report-to tensorboard \
    --logs ./logs/tensorboard \
    # ... other args

View metrics:

tensorboard --logdir=logs/tensorboard/ --port=7777

Metrics logged:

train/loss: Training loss per step
train/learning_rate: Current learning rate
imagenet-zeroshot-val-top1: Zero-shot ImageNet top-1 accuracy
imagenet-zeroshot-val-top5: Zero-shot ImageNet top-5 accuracy

Weights & Biases (wandb)

Enable wandb logging:

python -m open_clip_train.main \
    --report-to wandb \
    --wandb-project-name my-clip-project \
    # ... other args

Metrics are automatically synced to your wandb dashboard with:

Real-time loss curves
Zero-shot accuracy over time
System metrics (GPU utilization, etc.)

For older runs (before PR #613), use the step variable instead of Step in wandb, as the latter was not properly set.

Custom Metrics

You can add custom metrics by modifying the training loop:

from open_clip_train.main import train_one_epoch

def custom_metrics(model, data, epoch):
    # Your custom evaluation
    custom_score = evaluate_custom_task(model, data)
    return {'custom_metric': custom_score}

# Metrics will be logged to your configured backend

Metric Interpretation

ImageNet Zero-Shot Accuracy

Accuracy Range	Model Quality
< 30%	Baseline / Random
30-50%	Early training / Small models
50-65%	Decent models (ViT-B scale)
65-75%	Strong models (ViT-L scale)
75-80%	State-of-the-art (ViT-H scale)
> 80%	Cutting-edge (ViT-G/SigLIP scale)

Top-1 vs Top-5 Gap

The gap between top-1 and top-5 accuracy indicates:

Small gap (< 15%): Model is confident and accurate
Large gap (> 25%): Model often has correct answer in top 5 but not top 1, suggesting uncertainty or ambiguous classes

Cross-Dataset Performance

Strong models should maintain performance across datasets:

Consistent: Good performance across all 38 datasets
Specialized: High performance on some datasets but lower on others
Overfit: High ImageNet but low on distribution shift datasets

Computing Your Own Metrics

Using CLIP Benchmark

from clip_benchmark.datasets.builder import build_dataset
from clip_benchmark.metrics import zeroshot_classification

# Load dataset
dataset = build_dataset('imagenet1k', root='/path/to/imagenet', split='val')

# Evaluate
metrics = zeroshot_classification.evaluate(
    model,
    dataset,
    tokenizer,
    batch_size=64,
    num_workers=4
)

print(f"Top-1: {metrics['acc1']:.2%}")
print(f"Top-5: {metrics['acc5']:.2%}")
print(f"Mean per-class: {metrics['mean_per_class_recall']:.2%}")

Custom Evaluation Loop

import torch
from tqdm import tqdm

def evaluate_zero_shot(model, dataloader, classifier):
    model.eval()
    top1_correct = 0
    top5_correct = 0
    total = 0
    
    with torch.no_grad():
        for images, targets in tqdm(dataloader):
            images = images.cuda()
            targets = targets.cuda()
            
            # Get image features
            image_features = model.encode_image(images)
            image_features /= image_features.norm(dim=-1, keepdim=True)
            
            # Compute logits
            logits = 100.0 * image_features @ classifier
            
            # Top-1
            pred = logits.argmax(dim=-1)
            top1_correct += (pred == targets).sum().item()
            
            # Top-5
            _, top5_pred = logits.topk(5, dim=-1)
            top5_correct += (top5_pred == targets.unsqueeze(-1)).any(dim=-1).sum().item()
            
            total += len(targets)
    
    top1_acc = 100.0 * top1_correct / total
    top5_acc = 100.0 * top5_correct / total
    
    return {'top1': top1_acc, 'top5': top5_acc}

Best Practices

Use Standard Metrics: Stick to top-1, top-5, and recall@K for comparability with other work.

Report Multiple Datasets: ImageNet alone doesn’t tell the full story. Report performance on distribution shift and specialized datasets.

Log Frequently: Use --zeroshot-frequency 1 to track metrics every epoch during training.

Avoid Test Set Leakage: Always evaluate on validation or test sets that weren’t seen during training.

Next Steps

Learn how to run zero-shot evaluation
Explore benchmark results across 38 datasets
See training guide for optimizing metrics

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

Evaluation Metrics

Evaluation Metrics

Classification Metrics

Top-1 Accuracy

Top-5 Accuracy

Per-Class vs. Overall Accuracy

Zero-Shot Accuracy Computation

1. Text Classifier Construction

2. Image Encoding

3. Similarity Computation

4. Accuracy Calculation

Retrieval Metrics

Recall@K

Image-to-Text Retrieval

Text-to-Image Retrieval

Mean Rank

Aggregate Metrics

Average Performance Across Datasets

Weighted Average

Logging Metrics

TensorBoard

Weights & Biases (wandb)

Custom Metrics

Metric Interpretation

ImageNet Zero-Shot Accuracy

Top-1 vs Top-5 Gap

Cross-Dataset Performance

Computing Your Own Metrics

Using CLIP Benchmark

Custom Evaluation Loop

Best Practices

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Evaluation Metrics

​Classification Metrics

​Top-1 Accuracy

​Top-5 Accuracy

​Per-Class vs. Overall Accuracy

​Zero-Shot Accuracy Computation

​1. Text Classifier Construction

​2. Image Encoding

​3. Similarity Computation

​4. Accuracy Calculation

​Retrieval Metrics

​Recall@K

​Image-to-Text Retrieval

​Text-to-Image Retrieval

​Mean Rank

​Aggregate Metrics

​Average Performance Across Datasets

​Weighted Average

​Logging Metrics

​TensorBoard

​Weights & Biases (wandb)

​Custom Metrics

​Metric Interpretation

​ImageNet Zero-Shot Accuracy

​Top-1 vs Top-5 Gap

​Cross-Dataset Performance

​Computing Your Own Metrics

​Using CLIP Benchmark

​Custom Evaluation Loop

​Best Practices

​Next Steps

Build docs developers (and LLMs) love

Evaluation Metrics

Classification Metrics

Top-1 Accuracy

Top-5 Accuracy

Per-Class vs. Overall Accuracy

Zero-Shot Accuracy Computation

1. Text Classifier Construction

2. Image Encoding

3. Similarity Computation

4. Accuracy Calculation

Retrieval Metrics

Recall@K

Image-to-Text Retrieval

Text-to-Image Retrieval

Mean Rank

Aggregate Metrics

Average Performance Across Datasets

Weighted Average

Logging Metrics

TensorBoard

Weights & Biases (wandb)

Custom Metrics

Metric Interpretation

ImageNet Zero-Shot Accuracy

Top-1 vs Top-5 Gap

Cross-Dataset Performance

Computing Your Own Metrics

Using CLIP Benchmark

Custom Evaluation Loop

Best Practices

Next Steps