Evaluation Metrics
OpenCLIP uses several standard metrics to evaluate model performance on zero-shot classification and retrieval tasks.
Classification Metrics
Top-1 Accuracy
Top-1 accuracy is the primary metric for classification tasks. It measures the percentage of samples where the model’s highest-confidence prediction matches the ground truth label.
def accuracy(output, target, topk=(1,)):
"""Compute top-k accuracy"""
pred = output.topk(max(topk), 1, True, True)[1].t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
return [float(correct[:k].reshape(-1).float().sum(0)) for k in topk]
Formula:
Top-1 Accuracy = (Number of correct predictions) / (Total number of samples)
Example: If a model correctly classifies 633 out of 1000 images, its top-1 accuracy is 63.3%.
Top-5 Accuracy
Top-5 accuracy is more lenient—it considers a prediction correct if the ground truth label appears in the model’s top 5 predictions.
Formula:
Top-5 Accuracy = (Samples with correct label in top 5) / (Total number of samples)
Use Case: Top-5 is useful when:
- Classes are visually similar (e.g., dog breeds)
- The task has high inherent ambiguity
- Comparing models that might have similar top-1 but different top-5 performance
Per-Class vs. Overall Accuracy
OpenCLIP reports overall accuracy averaged across all samples. For class-imbalanced datasets, you might also want to compute per-class accuracy and take the mean.
Zero-Shot Accuracy Computation
Zero-shot accuracy in OpenCLIP is computed as follows:
1. Text Classifier Construction
For each class, generate multiple text embeddings using prompt templates:
from open_clip import IMAGENET_CLASSNAMES, OPENAI_IMAGENET_TEMPLATES
# Example class: "golden retriever"
classname = "golden retriever"
templates = [
"a photo of a {}.",
"a picture of a {}.",
"an image of a {}.",
# ... 80 templates total
]
# Encode all templates for this class
text_features = []
for template in templates:
text = template.format(classname)
text_features.append(model.encode_text(tokenize(text)))
# Average the features
class_embedding = torch.mean(torch.stack(text_features), dim=0)
class_embedding /= class_embedding.norm() # L2 normalize
2. Image Encoding
Encode the test image:
image_features = model.encode_image(preprocess(image))
image_features /= image_features.norm() # L2 normalize
3. Similarity Computation
Compute cosine similarity between image and all class embeddings:
logits = 100.0 * image_features @ class_embeddings.T
probs = logits.softmax(dim=-1)
predicted_class = probs.argmax()
The temperature scaling factor of 100.0 is used to sharpen the probability distribution.
4. Accuracy Calculation
Compare predictions to ground truth:
top1_count = 0
top5_count = 0
total = 0
for image, label in dataloader:
logits = compute_logits(model, image, class_embeddings)
# Top-1
pred = logits.argmax(dim=-1)
top1_count += (pred == label).sum().item()
# Top-5
top5_preds = logits.topk(5, dim=-1)[1]
top5_count += (top5_preds == label.unsqueeze(-1)).any(dim=-1).sum().item()
total += len(label)
top1_accuracy = top1_count / total
top5_accuracy = top5_count / total
Retrieval Metrics
For image-text retrieval tasks (like Flickr30k and MSCOCO), OpenCLIP uses standard retrieval metrics:
Recall@K
Recall@K measures the percentage of queries where the correct item appears in the top K retrieved results.
Formula:
Recall@K = (Queries with correct item in top K) / (Total queries)
Common values:
- R@1: Strictest metric (correct item must be rank 1)
- R@5: Correct item in top 5
- R@10: Correct item in top 10
Image-to-Text Retrieval
Given an image, retrieve relevant text captions:
image_features = model.encode_image(images) # [B, D]
text_features = model.encode_text(texts) # [N, D]
# Compute similarities
similarity = image_features @ text_features.T # [B, N]
# For each image, find top K texts
top_k_indices = similarity.topk(k=5, dim=1)[1] # [B, K]
# Check if correct caption is in top K
recall_at_5 = (top_k_indices == ground_truth_idx.unsqueeze(1)).any(dim=1).float().mean()
Text-to-Image Retrieval
Given a text query, retrieve relevant images:
# Transpose similarity matrix
similarity = text_features @ image_features.T # [N, B]
# For each text, find top K images
top_k_indices = similarity.topk(k=5, dim=1)[1] # [N, K]
# Check if correct image is in top K
recall_at_5 = (top_k_indices == ground_truth_idx.unsqueeze(1)).any(dim=1).float().mean()
Mean Rank
Mean rank measures the average position of the correct item in the ranked list:
ranks = []
for query_idx in range(len(queries)):
similarity_scores = compute_similarity(query_idx, all_items)
rank = (similarity_scores.argsort(descending=True) == ground_truth[query_idx]).nonzero()[0]
ranks.append(rank.item() + 1) # 1-indexed
mean_rank = sum(ranks) / len(ranks)
Lower mean rank is better.
Aggregate Metrics
The “Average perf. on 38 datasets” metric in our results is computed as:
average_performance = sum(accuracy_per_dataset) / num_datasets
This provides a single number summarizing model performance across the diverse evaluation suite.
Weighted Average
Some benchmarks use weighted averages where larger datasets have more influence:
weighted_avg = sum(accuracy[i] * size[i] for i in datasets) / sum(size)
OpenCLIP typically reports unweighted averages to give equal importance to each dataset.
Logging Metrics
OpenCLIP automatically logs metrics to your configured logging backend during training.
TensorBoard
Enable TensorBoard logging:
python -m open_clip_train.main \
--report-to tensorboard \
--logs ./logs/tensorboard \
# ... other args
View metrics:
tensorboard --logdir=logs/tensorboard/ --port=7777
Metrics logged:
train/loss: Training loss per step
train/learning_rate: Current learning rate
imagenet-zeroshot-val-top1: Zero-shot ImageNet top-1 accuracy
imagenet-zeroshot-val-top5: Zero-shot ImageNet top-5 accuracy
Weights & Biases (wandb)
Enable wandb logging:
python -m open_clip_train.main \
--report-to wandb \
--wandb-project-name my-clip-project \
# ... other args
Metrics are automatically synced to your wandb dashboard with:
- Real-time loss curves
- Zero-shot accuracy over time
- System metrics (GPU utilization, etc.)
For older runs (before PR #613), use the step variable instead of Step in wandb, as the latter was not properly set.
Custom Metrics
You can add custom metrics by modifying the training loop:
from open_clip_train.main import train_one_epoch
def custom_metrics(model, data, epoch):
# Your custom evaluation
custom_score = evaluate_custom_task(model, data)
return {'custom_metric': custom_score}
# Metrics will be logged to your configured backend
Metric Interpretation
ImageNet Zero-Shot Accuracy
| Accuracy Range | Model Quality |
|---|
| < 30% | Baseline / Random |
| 30-50% | Early training / Small models |
| 50-65% | Decent models (ViT-B scale) |
| 65-75% | Strong models (ViT-L scale) |
| 75-80% | State-of-the-art (ViT-H scale) |
| > 80% | Cutting-edge (ViT-G/SigLIP scale) |
Top-1 vs Top-5 Gap
The gap between top-1 and top-5 accuracy indicates:
- Small gap (< 15%): Model is confident and accurate
- Large gap (> 25%): Model often has correct answer in top 5 but not top 1, suggesting uncertainty or ambiguous classes
Strong models should maintain performance across datasets:
- Consistent: Good performance across all 38 datasets
- Specialized: High performance on some datasets but lower on others
- Overfit: High ImageNet but low on distribution shift datasets
Computing Your Own Metrics
Using CLIP Benchmark
from clip_benchmark.datasets.builder import build_dataset
from clip_benchmark.metrics import zeroshot_classification
# Load dataset
dataset = build_dataset('imagenet1k', root='/path/to/imagenet', split='val')
# Evaluate
metrics = zeroshot_classification.evaluate(
model,
dataset,
tokenizer,
batch_size=64,
num_workers=4
)
print(f"Top-1: {metrics['acc1']:.2%}")
print(f"Top-5: {metrics['acc5']:.2%}")
print(f"Mean per-class: {metrics['mean_per_class_recall']:.2%}")
Custom Evaluation Loop
import torch
from tqdm import tqdm
def evaluate_zero_shot(model, dataloader, classifier):
model.eval()
top1_correct = 0
top5_correct = 0
total = 0
with torch.no_grad():
for images, targets in tqdm(dataloader):
images = images.cuda()
targets = targets.cuda()
# Get image features
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
# Compute logits
logits = 100.0 * image_features @ classifier
# Top-1
pred = logits.argmax(dim=-1)
top1_correct += (pred == targets).sum().item()
# Top-5
_, top5_pred = logits.topk(5, dim=-1)
top5_correct += (top5_pred == targets.unsqueeze(-1)).any(dim=-1).sum().item()
total += len(targets)
top1_acc = 100.0 * top1_correct / total
top5_acc = 100.0 * top5_correct / total
return {'top1': top1_acc, 'top5': top5_acc}
Best Practices
Use Standard Metrics: Stick to top-1, top-5, and recall@K for comparability with other work.
Report Multiple Datasets: ImageNet alone doesn’t tell the full story. Report performance on distribution shift and specialized datasets.
Log Frequently: Use --zeroshot-frequency 1 to track metrics every epoch during training.
Avoid Test Set Leakage: Always evaluate on validation or test sets that weren’t seen during training.
Next Steps