Skip to main content

CLIP Overview

CLIP (Contrastive Language-Image Pre-training) is a neural network trained on image-text pairs that learns visual concepts from natural language supervision. OpenCLIP is an open-source implementation that enables training and using CLIP models at scale.

Architecture

CLIP uses a dual encoder architecture consisting of two separate towers:
  1. Vision Encoder - Processes images into fixed-dimensional feature vectors
  2. Text Encoder - Processes text into the same dimensional feature space
CLIP Architecture Diagram
Both encoders project their inputs into a shared embedding space where semantically similar images and text are close together.

Vision Encoder

The vision encoder transforms images into feature embeddings. OpenCLIP supports multiple vision architectures:
  • Vision Transformer (ViT) - Default architecture, divides image into patches
  • ResNet variants - Convolutional neural networks (ModifiedResNet)
  • ConvNeXt - Modern convolutional architectures
  • Custom architectures - Via timm integration
From src/open_clip/model.py:133-206:
def _build_vision_tower(
        embed_dim: int,
        vision_cfg: CLIPVisionCfg,
        quick_gelu: bool = False,
        cast_dtype: Optional[torch.dtype] = None
):
    # Vision encoder can be ViT, ResNet, or timm model
    if vision_cfg.timm_model_name:
        visual = TimmModel(...)
    elif isinstance(vision_cfg.layers, (tuple, list)):
        visual = ModifiedResNet(...)  # ResNet architecture
    else:
        visual = VisionTransformer(...)  # ViT architecture
    return visual

Text Encoder

The text encoder processes text descriptions into embeddings. It uses a Transformer architecture with:
  • Token embeddings
  • Positional embeddings
  • Multi-head self-attention layers
  • Feed-forward networks
  • Optional HuggingFace models (RoBERTa, etc.)
From src/open_clip/model.py:209-262:
def _build_text_tower(
        embed_dim: int,
        text_cfg: CLIPTextCfg,
        quick_gelu: bool = False,
        cast_dtype: Optional[torch.dtype] = None,
):
    if text_cfg.hf_model_name:
        text = HFTextEncoder(...)  # HuggingFace models
    else:
        text = TextTransformer(...)  # Native transformer
    return text

Contrastive Pre-training Objective

CLIP is trained using a contrastive learning objective that aligns image and text representations:
  1. Batch Construction - Sample N (image, text) pairs
  2. Encoding - Pass images through vision encoder, texts through text encoder
  3. Similarity Matrix - Compute N×N cosine similarities between all image-text pairs
  4. Contrastive Loss - Maximize similarity for correct pairs, minimize for incorrect pairs
The model learns to:
  • Push matching image-text pairs closer together in embedding space
  • Push non-matching pairs further apart
See Contrastive Learning for detailed loss function explanation.

Image-Text Similarity Scoring

Once trained, CLIP computes similarity between any image and text through:

1. Encode Both Modalities

# From README example
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize to unit vectors
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

2. Compute Cosine Similarity

From src/open_clip/model.py:347-354:
def get_logits(self, image, text):
    image_features = self.encode_image(image, normalize=True)
    text_features = self.encode_text(text, normalize=True)
    
    # Scaled dot product (cosine similarity with temperature)
    image_logits = self.logit_scale.exp() * image_features @ text_features.T
    
    if self.logit_bias is not None:
        image_logits += self.logit_bias
    return image_logits, text_logits

3. Temperature Scaling

The logit_scale parameter controls the sharpness of the similarity distribution:
  • Higher temperature (lower scale) → softer, more uniform probabilities
  • Lower temperature (higher scale) → sharper, more peaked probabilities
Initialized as: np.log(1 / 0.07) ≈ 2.66 (learned during training)

Key Properties

Joint Embedding Space

Both vision and text encoders project into the same dimensional space (typically 512 or 768 dimensions), enabling direct comparison.

Zero-Shot Capability

CLIP can classify images into categories it wasn’t explicitly trained on by comparing image embeddings with text embeddings of class names. See Zero-Shot Classification.

Flexibility

CLIP can be used for:
  • Image classification (zero-shot or with prompts)
  • Image-text retrieval
  • Visual question answering
  • Image captioning (with additional decoder, e.g., CoCa)

Training in OpenCLIP

OpenCLIP supports large-scale training with:
  • Multi-GPU distributed training - Up to 1024 GPUs tested
  • Multiple data sources - WebDataset format for billion-scale datasets
  • Efficient batch construction - Local loss and gradient gathering
  • Mixed precision - FP16/BF16 for faster training
Example training command:
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 4 \
    --model ViT-B-32

Reference

Original CLIP Paper: Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. 📄 arXiv:2103.00020 OpenCLIP Paper: Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., … & Jitsev, J. (2023). Reproducible scaling laws for contrastive language-image learning. CVPR 2023. 📄 arXiv:2212.07143

Next Steps

Contrastive Learning

Deep dive into the loss function and training objective

Zero-Shot Classification

Learn how CLIP classifies without task-specific training

Build docs developers (and LLMs) love