Introduction
OpenCLIP provides a complete framework for training CLIP (Contrastive Language-Image Pre-training) models from scratch or fine-tuning existing models. This section covers everything you need to know about training CLIP models.When to Train vs Use Pretrained Models
Use Pretrained Models When:
- You need quick deployment for zero-shot classification
- Your use case aligns with the training data of existing models
- You have limited computational resources
- You want to leverage models trained on billions of samples (LAION-2B, DataComp-1B)
Train Custom Models When:
- You have domain-specific data that differs significantly from public datasets
- You need to optimize for specific image-text distributions
- You want to experiment with new architectures or training techniques
- You have access to large-scale compute and unique datasets
- You need to comply with specific data governance requirements
Training Requirements
Hardware Requirements
Minimum for Experimentation:- 1 GPU with 16GB+ VRAM (e.g., V100, A100, RTX 3090)
- Suitable for small models (RN50, ViT-B/32) on small datasets
- Multiple GPUs (4-32+) for efficient distributed training
- A100 or H100 GPUs for best performance
- High-bandwidth interconnect (NVLink, InfiniBand) for multi-node training
- Fast storage (NVMe SSD) for data loading
- OpenCLIP has been tested up to 1024 A100 GPUs
- Requires SLURM or similar cluster management
- See Multi-Node Training for details
Data Requirements
Data Format:- CSV format with image paths and captions
- WebDataset format (.tar files) for large-scale training
- See Data Preparation for details
- Small-scale experiments: 1M-10M pairs (CC3M, CC12M)
- Medium-scale: 100M-400M pairs (LAION-400M)
- Large-scale: 1B+ pairs (LAION-2B, DataComp-1B)
- Plan for 100GB-10TB+ depending on dataset size
- WebDataset format is recommended for datasets over 10M samples
Software Requirements
Training Process Overview
1. Data Preparation
Prepare your image-text pairs in either CSV or WebDataset format:2. Model Selection
Choose an architecture based on your compute budget and accuracy requirements:- RN50, RN101: ResNet-based models, good for initial experiments
- ViT-B/32, ViT-B/16: Vision Transformer models, better accuracy
- ViT-L/14, ViT-H/14: Large models for maximum performance
- ConvNext variants: Modern ConvNet architectures
3. Configuration
Set up training hyperparameters:4. Training Execution
Choose your training setup:- Single-Node Training: One machine with multiple GPUs
- Multi-Node Training: Multiple machines via torchrun or SLURM
- Distributed Training: Advanced distributed training techniques
5. Monitoring
Track training progress with:- TensorBoard:
--report-to tensorboard - Weights & Biases:
--report-to wandb - Zero-shot evaluation: Automatic ImageNet validation during training
6. Evaluation
Evaluate trained models on downstream tasks:- Zero-shot classification on ImageNet and other datasets
- Retrieval tasks (image-to-text, text-to-image)
- Fine-tuning on specific tasks (see Fine-tuning)
Training Variants
Standard CLIP Training
Contrastive learning between image and text encoders:CoCa Training
Contrastive Captioner models that combine contrastive and generative objectives:Fine-tuning
Continue training from pretrained checkpoints:Key Training Features
Memory Efficiency
- Gradient Checkpointing:
--grad-checkpointingreduces memory usage - Mixed Precision:
--precision ampfor faster training with lower memory - Local Loss:
--local-lossfor linear memory scaling in distributed training - Patch Dropout: Speed up ViT training by 2-3x without accuracy loss
Scalability
- Distributed Data Parallel (DDP): Automatic multi-GPU training
- Gradient Accumulation:
--accum-freqto simulate larger batch sizes - Efficient Distributed Loss:
--gather-with-grad --local-lossfor O(n) vs O(n²) scaling
Flexibility
- Multiple Data Sources: Combine datasets with
::separator - Data Upsampling: Weight different data sources with
--train-data-upsampling-factors - Custom Architectures: Support for custom model configs
- Remote Checkpointing: Save checkpoints to S3 or other remote storage
Quick Start Example
Here’s a complete single-node training example:Next Steps
Single-Node Training
Train on a single machine with multiple GPUs using torchrun
Data Preparation
Learn how to prepare and format your training data
Configuration
Explore all training parameters and hyperparameters
Distributed Training
Advanced techniques for efficient large-scale training
Common Training Workflows
Workflow 1: Small-Scale Experimentation
- Prepare CSV dataset (1M-10M samples)
- Train on single GPU or small cluster
- Monitor with TensorBoard
- Evaluate on ImageNet zero-shot
Workflow 2: Large-Scale Training
- Convert dataset to WebDataset format with img2dataset
- Set up SLURM cluster with multiple nodes
- Use distributed training flags (
--local-loss --gather-with-grad) - Enable remote checkpoint syncing to S3
- Monitor with Weights & Biases
- Evaluate with CLIP_benchmark
Workflow 3: Fine-tuning
- Start from pretrained checkpoint
- Reduce learning rate (1e-5 to 1e-4)
- Train for fewer epochs (1-10)
- Optionally use WiSE-FT for robust fine-tuning
Performance Tips
Training Time Estimates
| Model | Dataset | GPUs | Batch Size | Time per Epoch | Total Time (32 epochs) |
|---|---|---|---|---|---|
| RN50 | CC3M | 4x V100 | 1024 | ~30 min | ~16 hours |
| ViT-B/32 | CC12M | 8x A100 | 2560 | ~1 hour | ~32 hours |
| ViT-L/14 | LAION-400M | 64x A100 | 16384 | ~8 hours | ~11 days |
| ViT-H/14 | LAION-2B | 256x A100 | 65536 | ~20 hours | ~27 days |
Additional Resources
- OpenCLIP Paper: Reproducible scaling laws for contrastive language-image learning
- Original CLIP Paper: Learning Transferable Visual Models From Natural Language Supervision
- LAION-5B Paper: An open large-scale dataset for training next generation image-text models
- img2dataset: Tool for downloading and preprocessing image datasets
- CLIP_benchmark: Systematic evaluation on 40+ datasets
