Skip to main content

Evaluation Metrics

OpenCLIP uses several standard metrics to evaluate model performance on zero-shot classification and retrieval tasks.

Classification Metrics

Top-1 Accuracy

Top-1 accuracy is the primary metric for classification tasks. It measures the percentage of samples where the model’s highest-confidence prediction matches the ground truth label.
def accuracy(output, target, topk=(1,)):
    """Compute top-k accuracy"""
    pred = output.topk(max(topk), 1, True, True)[1].t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))
    return [float(correct[:k].reshape(-1).float().sum(0)) for k in topk]
Formula:
Top-1 Accuracy = (Number of correct predictions) / (Total number of samples)
Example: If a model correctly classifies 633 out of 1000 images, its top-1 accuracy is 63.3%.

Top-5 Accuracy

Top-5 accuracy is more lenient—it considers a prediction correct if the ground truth label appears in the model’s top 5 predictions. Formula:
Top-5 Accuracy = (Samples with correct label in top 5) / (Total number of samples)
Use Case: Top-5 is useful when:
  • Classes are visually similar (e.g., dog breeds)
  • The task has high inherent ambiguity
  • Comparing models that might have similar top-1 but different top-5 performance

Per-Class vs. Overall Accuracy

OpenCLIP reports overall accuracy averaged across all samples. For class-imbalanced datasets, you might also want to compute per-class accuracy and take the mean.

Zero-Shot Accuracy Computation

Zero-shot accuracy in OpenCLIP is computed as follows:

1. Text Classifier Construction

For each class, generate multiple text embeddings using prompt templates:
from open_clip import IMAGENET_CLASSNAMES, OPENAI_IMAGENET_TEMPLATES

# Example class: "golden retriever"
classname = "golden retriever"
templates = [
    "a photo of a {}.",
    "a picture of a {}.",
    "an image of a {}.",
    # ... 80 templates total
]

# Encode all templates for this class
text_features = []
for template in templates:
    text = template.format(classname)
    text_features.append(model.encode_text(tokenize(text)))

# Average the features
class_embedding = torch.mean(torch.stack(text_features), dim=0)
class_embedding /= class_embedding.norm()  # L2 normalize

2. Image Encoding

Encode the test image:
image_features = model.encode_image(preprocess(image))
image_features /= image_features.norm()  # L2 normalize

3. Similarity Computation

Compute cosine similarity between image and all class embeddings:
logits = 100.0 * image_features @ class_embeddings.T
probs = logits.softmax(dim=-1)
predicted_class = probs.argmax()
The temperature scaling factor of 100.0 is used to sharpen the probability distribution.

4. Accuracy Calculation

Compare predictions to ground truth:
top1_count = 0
top5_count = 0
total = 0

for image, label in dataloader:
    logits = compute_logits(model, image, class_embeddings)
    
    # Top-1
    pred = logits.argmax(dim=-1)
    top1_count += (pred == label).sum().item()
    
    # Top-5
    top5_preds = logits.topk(5, dim=-1)[1]
    top5_count += (top5_preds == label.unsqueeze(-1)).any(dim=-1).sum().item()
    
    total += len(label)

top1_accuracy = top1_count / total
top5_accuracy = top5_count / total

Retrieval Metrics

For image-text retrieval tasks (like Flickr30k and MSCOCO), OpenCLIP uses standard retrieval metrics:

Recall@K

Recall@K measures the percentage of queries where the correct item appears in the top K retrieved results. Formula:
Recall@K = (Queries with correct item in top K) / (Total queries)
Common values:
  • R@1: Strictest metric (correct item must be rank 1)
  • R@5: Correct item in top 5
  • R@10: Correct item in top 10

Image-to-Text Retrieval

Given an image, retrieve relevant text captions:
image_features = model.encode_image(images)  # [B, D]
text_features = model.encode_text(texts)      # [N, D]

# Compute similarities
similarity = image_features @ text_features.T  # [B, N]

# For each image, find top K texts
top_k_indices = similarity.topk(k=5, dim=1)[1]  # [B, K]

# Check if correct caption is in top K
recall_at_5 = (top_k_indices == ground_truth_idx.unsqueeze(1)).any(dim=1).float().mean()

Text-to-Image Retrieval

Given a text query, retrieve relevant images:
# Transpose similarity matrix
similarity = text_features @ image_features.T  # [N, B]

# For each text, find top K images
top_k_indices = similarity.topk(k=5, dim=1)[1]  # [N, K]

# Check if correct image is in top K
recall_at_5 = (top_k_indices == ground_truth_idx.unsqueeze(1)).any(dim=1).float().mean()

Mean Rank

Mean rank measures the average position of the correct item in the ranked list:
ranks = []
for query_idx in range(len(queries)):
    similarity_scores = compute_similarity(query_idx, all_items)
    rank = (similarity_scores.argsort(descending=True) == ground_truth[query_idx]).nonzero()[0]
    ranks.append(rank.item() + 1)  # 1-indexed

mean_rank = sum(ranks) / len(ranks)
Lower mean rank is better.

Aggregate Metrics

Average Performance Across Datasets

The “Average perf. on 38 datasets” metric in our results is computed as:
average_performance = sum(accuracy_per_dataset) / num_datasets
This provides a single number summarizing model performance across the diverse evaluation suite.

Weighted Average

Some benchmarks use weighted averages where larger datasets have more influence:
weighted_avg = sum(accuracy[i] * size[i] for i in datasets) / sum(size)
OpenCLIP typically reports unweighted averages to give equal importance to each dataset.

Logging Metrics

OpenCLIP automatically logs metrics to your configured logging backend during training.

TensorBoard

Enable TensorBoard logging:
python -m open_clip_train.main \
    --report-to tensorboard \
    --logs ./logs/tensorboard \
    # ... other args
View metrics:
tensorboard --logdir=logs/tensorboard/ --port=7777
Metrics logged:
  • train/loss: Training loss per step
  • train/learning_rate: Current learning rate
  • imagenet-zeroshot-val-top1: Zero-shot ImageNet top-1 accuracy
  • imagenet-zeroshot-val-top5: Zero-shot ImageNet top-5 accuracy

Weights & Biases (wandb)

Enable wandb logging:
python -m open_clip_train.main \
    --report-to wandb \
    --wandb-project-name my-clip-project \
    # ... other args
Metrics are automatically synced to your wandb dashboard with:
  • Real-time loss curves
  • Zero-shot accuracy over time
  • System metrics (GPU utilization, etc.)
For older runs (before PR #613), use the step variable instead of Step in wandb, as the latter was not properly set.

Custom Metrics

You can add custom metrics by modifying the training loop:
from open_clip_train.main import train_one_epoch

def custom_metrics(model, data, epoch):
    # Your custom evaluation
    custom_score = evaluate_custom_task(model, data)
    return {'custom_metric': custom_score}

# Metrics will be logged to your configured backend

Metric Interpretation

ImageNet Zero-Shot Accuracy

Accuracy RangeModel Quality
< 30%Baseline / Random
30-50%Early training / Small models
50-65%Decent models (ViT-B scale)
65-75%Strong models (ViT-L scale)
75-80%State-of-the-art (ViT-H scale)
> 80%Cutting-edge (ViT-G/SigLIP scale)

Top-1 vs Top-5 Gap

The gap between top-1 and top-5 accuracy indicates:
  • Small gap (< 15%): Model is confident and accurate
  • Large gap (> 25%): Model often has correct answer in top 5 but not top 1, suggesting uncertainty or ambiguous classes

Cross-Dataset Performance

Strong models should maintain performance across datasets:
  • Consistent: Good performance across all 38 datasets
  • Specialized: High performance on some datasets but lower on others
  • Overfit: High ImageNet but low on distribution shift datasets

Computing Your Own Metrics

Using CLIP Benchmark

from clip_benchmark.datasets.builder import build_dataset
from clip_benchmark.metrics import zeroshot_classification

# Load dataset
dataset = build_dataset('imagenet1k', root='/path/to/imagenet', split='val')

# Evaluate
metrics = zeroshot_classification.evaluate(
    model,
    dataset,
    tokenizer,
    batch_size=64,
    num_workers=4
)

print(f"Top-1: {metrics['acc1']:.2%}")
print(f"Top-5: {metrics['acc5']:.2%}")
print(f"Mean per-class: {metrics['mean_per_class_recall']:.2%}")

Custom Evaluation Loop

import torch
from tqdm import tqdm

def evaluate_zero_shot(model, dataloader, classifier):
    model.eval()
    top1_correct = 0
    top5_correct = 0
    total = 0
    
    with torch.no_grad():
        for images, targets in tqdm(dataloader):
            images = images.cuda()
            targets = targets.cuda()
            
            # Get image features
            image_features = model.encode_image(images)
            image_features /= image_features.norm(dim=-1, keepdim=True)
            
            # Compute logits
            logits = 100.0 * image_features @ classifier
            
            # Top-1
            pred = logits.argmax(dim=-1)
            top1_correct += (pred == targets).sum().item()
            
            # Top-5
            _, top5_pred = logits.topk(5, dim=-1)
            top5_correct += (top5_pred == targets.unsqueeze(-1)).any(dim=-1).sum().item()
            
            total += len(targets)
    
    top1_acc = 100.0 * top1_correct / total
    top5_acc = 100.0 * top5_correct / total
    
    return {'top1': top1_acc, 'top5': top5_acc}

Best Practices

Use Standard Metrics: Stick to top-1, top-5, and recall@K for comparability with other work.
Report Multiple Datasets: ImageNet alone doesn’t tell the full story. Report performance on distribution shift and specialized datasets.
Log Frequently: Use --zeroshot-frequency 1 to track metrics every epoch during training.
Avoid Test Set Leakage: Always evaluate on validation or test sets that weren’t seen during training.

Next Steps

Build docs developers (and LLMs) love