Skip to main content

Introduction

OpenCLIP provides a complete framework for training CLIP (Contrastive Language-Image Pre-training) models from scratch or fine-tuning existing models. This section covers everything you need to know about training CLIP models.

When to Train vs Use Pretrained Models

Use Pretrained Models When:

  • You need quick deployment for zero-shot classification
  • Your use case aligns with the training data of existing models
  • You have limited computational resources
  • You want to leverage models trained on billions of samples (LAION-2B, DataComp-1B)
OpenCLIP offers numerous pretrained models with varying sizes and capabilities, from small RN50 models to large ViT-bigG-14 models.

Train Custom Models When:

  • You have domain-specific data that differs significantly from public datasets
  • You need to optimize for specific image-text distributions
  • You want to experiment with new architectures or training techniques
  • You have access to large-scale compute and unique datasets
  • You need to comply with specific data governance requirements

Training Requirements

Hardware Requirements

Minimum for Experimentation:
  • 1 GPU with 16GB+ VRAM (e.g., V100, A100, RTX 3090)
  • Suitable for small models (RN50, ViT-B/32) on small datasets
Recommended for Production:
  • Multiple GPUs (4-32+) for efficient distributed training
  • A100 or H100 GPUs for best performance
  • High-bandwidth interconnect (NVLink, InfiniBand) for multi-node training
  • Fast storage (NVMe SSD) for data loading
Large-Scale Training:
  • OpenCLIP has been tested up to 1024 A100 GPUs
  • Requires SLURM or similar cluster management
  • See Multi-Node Training for details

Data Requirements

Data Format:
  • CSV format with image paths and captions
  • WebDataset format (.tar files) for large-scale training
  • See Data Preparation for details
Data Scale:
  • Small-scale experiments: 1M-10M pairs (CC3M, CC12M)
  • Medium-scale: 100M-400M pairs (LAION-400M)
  • Large-scale: 1B+ pairs (LAION-2B, DataComp-1B)
Storage:
  • Plan for 100GB-10TB+ depending on dataset size
  • WebDataset format is recommended for datasets over 10M samples

Software Requirements

python3 -m venv .env
source .env/bin/activate
pip install -U pip
pip install 'open_clip_torch[training]'
See the Installation Guide for detailed setup instructions.

Training Process Overview

1. Data Preparation

Prepare your image-text pairs in either CSV or WebDataset format:
# CSV format (small datasets)
filepath,title
/path/to/image1.jpg,"A photo of a cat"
/path/to/image2.jpg,"A dog playing in the park"
For large datasets, use img2dataset to convert to WebDataset format. Learn more: Data Preparation

2. Model Selection

Choose an architecture based on your compute budget and accuracy requirements:
  • RN50, RN101: ResNet-based models, good for initial experiments
  • ViT-B/32, ViT-B/16: Vision Transformer models, better accuracy
  • ViT-L/14, ViT-H/14: Large models for maximum performance
  • ConvNext variants: Modern ConvNet architectures

3. Configuration

Set up training hyperparameters:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --batch-size 256 \
    --lr 1e-3 \
    --warmup 10000 \
    --epochs 32 \
    --workers 8
See Configuration for all available options.

4. Training Execution

Choose your training setup:

5. Monitoring

Track training progress with:
  • TensorBoard: --report-to tensorboard
  • Weights & Biases: --report-to wandb
  • Zero-shot evaluation: Automatic ImageNet validation during training
# View TensorBoard logs
tensorboard --logdir=logs/tensorboard/ --port=7777

6. Evaluation

Evaluate trained models on downstream tasks:
  • Zero-shot classification on ImageNet and other datasets
  • Retrieval tasks (image-to-text, text-to-image)
  • Fine-tuning on specific tasks (see Fine-tuning)
For comprehensive evaluation, use CLIP_benchmark which supports 40+ datasets.

Training Variants

Standard CLIP Training

Contrastive learning between image and text encoders:
python -m open_clip_train.main \
    --train-data "/data/train.csv" \
    --model ViT-B-32 \
    --batch-size 256

CoCa Training

Contrastive Captioner models that combine contrastive and generative objectives:
python -m open_clip_train.main \
    --model coca_ViT-L-14 \
    --coca-contrastive-loss-weight 1.0 \
    --coca-caption-loss-weight 2.0
Learn more: CoCa Training

Fine-tuning

Continue training from pretrained checkpoints:
python -m open_clip_train.main \
    --resume /path/to/checkpoint.pt \
    --lr 1e-5 \
    --epochs 10
Learn more: Fine-tuning

Key Training Features

Memory Efficiency

  • Gradient Checkpointing: --grad-checkpointing reduces memory usage
  • Mixed Precision: --precision amp for faster training with lower memory
  • Local Loss: --local-loss for linear memory scaling in distributed training
  • Patch Dropout: Speed up ViT training by 2-3x without accuracy loss

Scalability

  • Distributed Data Parallel (DDP): Automatic multi-GPU training
  • Gradient Accumulation: --accum-freq to simulate larger batch sizes
  • Efficient Distributed Loss: --gather-with-grad --local-loss for O(n) vs O(n²) scaling

Flexibility

  • Multiple Data Sources: Combine datasets with :: separator
  • Data Upsampling: Weight different data sources with --train-data-upsampling-factors
  • Custom Architectures: Support for custom model configs
  • Remote Checkpointing: Save checkpoints to S3 or other remote storage

Quick Start Example

Here’s a complete single-node training example:
cd open_clip/src
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 4 \
    --imagenet-val /data/imagenet/validation/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --report-to tensorboard
This will train a ViT-B/32 model on CC12M using 4 GPUs with automatic mixed precision.

Next Steps

Single-Node Training

Train on a single machine with multiple GPUs using torchrun

Data Preparation

Learn how to prepare and format your training data

Configuration

Explore all training parameters and hyperparameters

Distributed Training

Advanced techniques for efficient large-scale training

Common Training Workflows

Workflow 1: Small-Scale Experimentation

  1. Prepare CSV dataset (1M-10M samples)
  2. Train on single GPU or small cluster
  3. Monitor with TensorBoard
  4. Evaluate on ImageNet zero-shot

Workflow 2: Large-Scale Training

  1. Convert dataset to WebDataset format with img2dataset
  2. Set up SLURM cluster with multiple nodes
  3. Use distributed training flags (--local-loss --gather-with-grad)
  4. Enable remote checkpoint syncing to S3
  5. Monitor with Weights & Biases
  6. Evaluate with CLIP_benchmark

Workflow 3: Fine-tuning

  1. Start from pretrained checkpoint
  2. Reduce learning rate (1e-5 to 1e-4)
  3. Train for fewer epochs (1-10)
  4. Optionally use WiSE-FT for robust fine-tuning

Performance Tips

For fastest training:
  • Use --precision amp for mixed precision
  • Enable --grad-checkpointing to trade compute for memory
  • Use WebDataset format for large datasets
  • Set --workers to number of CPU cores per GPU (typically 4-8)
  • Use --local-loss --gather-with-grad for multi-node training
Avoid common pitfalls:
  • Don’t use CSV format for datasets over 10M samples
  • Don’t forget to set --imagenet-val to the validation set (not training set)
  • Don’t use --accum-freq without trying other memory optimizations first
  • Monitor GPU utilization - low utilization often indicates data loading bottlenecks

Training Time Estimates

ModelDatasetGPUsBatch SizeTime per EpochTotal Time (32 epochs)
RN50CC3M4x V1001024~30 min~16 hours
ViT-B/32CC12M8x A1002560~1 hour~32 hours
ViT-L/14LAION-400M64x A10016384~8 hours~11 days
ViT-H/14LAION-2B256x A10065536~20 hours~27 days
Note: These are approximate estimates. Actual times vary based on hardware, data loading, and configuration.

Additional Resources

  • OpenCLIP Paper: Reproducible scaling laws for contrastive language-image learning
  • Original CLIP Paper: Learning Transferable Visual Models From Natural Language Supervision
  • LAION-5B Paper: An open large-scale dataset for training next generation image-text models
  • img2dataset: Tool for downloading and preprocessing image datasets
  • CLIP_benchmark: Systematic evaluation on 40+ datasets

Build docs developers (and LLMs) love