Skip to main content

Overview

Fine-tuning allows you to adapt pretrained CLIP models to specific domains or datasets by continuing training from a pretrained checkpoint. This is often more efficient than training from scratch and can achieve better performance with less data.

When to Fine-tune

✅ Fine-tune when:

  • You have a pretrained model that’s close to your target domain
  • You have limited training data (1M-100M samples)
  • You want to adapt to a specific domain (medical images, satellite imagery, etc.)
  • You need faster convergence than training from scratch
  • You want to improve zero-shot performance on specific tasks

❌ Train from scratch when:

  • Your domain is very different from the pretrained model’s training data
  • You have massive amounts of training data (>1B samples)
  • You need a completely custom architecture
  • You want to experiment with new training objectives

Loading Pretrained Weights

From OpenCLIP Pretrained Models

Use the --pretrained flag with a model tag:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/custom_dataset.tar" \
    --lr 1e-5 \
    --epochs 10
Available pretrained tags:
import open_clip

# List all pretrained models
for model_name, pretrained in open_clip.list_pretrained():
    print(f"{model_name}: {pretrained}")
Common pretrained weights:
  • laion2b_s34b_b79k: ViT-B/32 on LAION-2B
  • laion2b_s32b_b82k: ViT-L/14 on LAION-2B
  • openai: Original OpenAI CLIP weights
  • datacomp_xl_s13b_b90k: DataComp-1B models

From Local Checkpoint

Use a local checkpoint file:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained /path/to/checkpoint.pt \
    --train-data "/data/custom_dataset.tar" \
    --lr 1e-5 \
    --epochs 10

From Hugging Face Hub

Download from Hugging Face and use local path:
# Download from HF
wget https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/open_clip_pytorch_model.bin

# Use in training
python -m open_clip_train.main \
    --model ViT-L-14 \
    --pretrained /path/to/open_clip_pytorch_model.bin \
    # ... other arguments

Resuming Training from Checkpoint

The --resume flag continues training from a checkpoint, including optimizer state:
python -m open_clip_train.main \
    --train-data "/path/to/train_data.csv" \
    --val-data "/path/to/validation_data.csv" \
    --resume /path/to/checkpoints/epoch_K.pt \
    --model ViT-B-32 \
    # ... other arguments should match original training
Resume vs Pretrained:
FlagUse CaseLoads OptimizerLoads EpochLearning Rate
--resumeContinue interrupted training✅ Yes✅ YesOriginal schedule continues
--pretrainedFine-tune from pretrained❌ No❌ NoNew schedule from epoch 0

Resume from Latest Checkpoint

python -m open_clip_train.main \
    --resume latest \
    # ... other arguments
Automatically finds and loads the most recent checkpoint in the logs directory.

Fine-tuning Strategies

1. Full Model Fine-tuning

Fine-tune all parameters with a lower learning rate:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/domain_specific.tar" \
    --train-num-samples 10000000 \
    --dataset-type webdataset \
    --lr 1e-5 \
    --warmup 1000 \
    --epochs 10 \
    --batch-size 256 \
    --precision amp \
    --workers 8
Key changes from pretraining:
  • ⬇️ Lower learning rate: 1e-5 vs 1e-3 for pretraining
  • ⏱️ Fewer epochs: 10 vs 32 for pretraining
  • 🔥 Shorter warmup: 1000 vs 10000 steps

2. Frozen Image Encoder (Text-Only Fine-tuning)

Freeze the image encoder and only fine-tune the text encoder:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-image \
    --train-data "/data/domain_specific.tar" \
    --lr 1e-4 \
    --epochs 10
Benefits:
  • 💾 Lower memory usage
  • ⚡ Faster training
  • 🎯 Useful when adapting to new vocabulary/concepts

3. Frozen Text Encoder (Image-Only Fine-tuning)

Freeze the text encoder and only fine-tune the image encoder:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-text \
    --train-data "/data/domain_specific.tar" \
    --lr 1e-4 \
    --epochs 10
Use cases:
  • Adapting to new image domains (medical, satellite, etc.)
  • Maintaining text understanding while improving visual features

4. Partial Fine-tuning

Freeze early layers and fine-tune later layers:
# Fine-tune last 2 groups of image encoder
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-image \
    --lock-image-unlocked-groups 2 \
    --train-data "/data/domain_specific.tar" \
    --lr 5e-5 \
    --epochs 10

# Fine-tune last 10 layers of text encoder  
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --train-data "/data/domain_specific.tar" \
    --lr 5e-5 \
    --epochs 10
Benefits:
  • ⚖️ Balance between adaptation and preservation
  • 💾 Lower memory and compute requirements
  • 🛡️ Less prone to overfitting on small datasets

5. LiT (Locked Image Tuning)

Lock image encoder with ImageNet pretrained weights, train text encoder from scratch:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained-image \
    --lock-image \
    --lock-image-freeze-bn-stats \
    --train-data "/data/train.tar" \
    --lr 1e-3 \
    --epochs 32
Reference: LiT: Zero-Shot Transfer with Locked-image Text Tuning

Learning Rate Adjustment

Fine-tuning requires careful learning rate selection:
StrategyLearning RateWarmup StepsEpochs
Full fine-tuning1e-5 to 1e-41000-50005-10
Partial fine-tuning5e-5 to 5e-41000-30005-10
Frozen encoder1e-4 to 1e-31000-500010-20
From scratch (reference)1e-3 to 5e-31000032+

Learning Rate Schedules

Cosine with warmup (recommended):
--lr 1e-5 \
--warmup 1000 \
--lr-scheduler cosine \
--epochs 10
Constant with warmup:
--lr 1e-5 \
--warmup 1000 \
--lr-scheduler const \
--epochs 10
Constant with cooldown:
--lr 1e-4 \
--warmup 1000 \
--lr-scheduler const-cooldown \
--epochs-cooldown 2 \
--lr-cooldown-end 1e-6 \
--epochs 10

Fine-tuning Examples

Domain Adaptation: Medical Images

# Fine-tune ViT-B/32 on medical imaging dataset
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/medical_images/train-{0000..0100}.tar" \
    --train-num-samples 1000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 6 \
    --lr 5e-5 \
    --warmup 2000 \
    --epochs 10 \
    --save-frequency 2 \
    --imagenet-val /data/imagenet/val/ \
    --report-to wandb \
    --name "vit-b32-medical-finetuned"

Small Dataset Fine-tuning

# Fine-tune on small dataset (100k samples)
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/small_dataset.csv" \
    --dataset-type csv \
    --csv-img-key filepath \
    --csv-caption-key title \
    --batch-size 128 \
    --precision amp \
    --workers 4 \
    --lr 1e-5 \
    --warmup 500 \
    --epochs 20 \
    --lock-image \
    --lock-image-unlocked-groups 1 \
    --report-to tensorboard

Multilingual Fine-tuning

# Adapt to new language while keeping visual encoder
python -m open_clip_train.main \
    --model ViT-L-14 \
    --pretrained laion2b_s32b_b82k \
    --lock-image \
    --train-data "/data/chinese_captions.tar" \
    --train-num-samples 50000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --lr 1e-4 \
    --warmup 5000 \
    --epochs 15 \
    --precision amp \
    --workers 8

High-Resolution Fine-tuning

# Fine-tune at higher resolution (336px instead of 224px)
python -m open_clip_train.main \
    --model ViT-L-14 \
    --pretrained laion2b_s32b_b82k \
    --force-image-size 336 \
    --train-data "/data/high_res.tar" \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --lr 1e-5 \
    --epochs 5

WiSE-FT: Robust Fine-tuning

For robust fine-tuning that maintains performance under distribution shift, use the WiSE-FT repository. WiSE-FT (Weight-Space Ensembling for Fine-Tuning) averages the weights of:
  1. Zero-shot pretrained model
  2. Fine-tuned model
This preserves robustness while improving accuracy.

WiSE-FT Workflow

# 1. Fine-tune on target dataset
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/imagenet/train.csv" \
    --lr 1e-5 \
    --epochs 10 \
    --name "imagenet-finetuned"

# 2. Use WiSE-FT to ensemble weights
# See https://github.com/mlfoundations/wise-ft for details
Reference: Robust Fine-tuning of Zero-shot Models

Monitoring Fine-tuning

Zero-shot Evaluation

Track zero-shot performance during fine-tuning:
python -m open_clip_train.main \
    --pretrained laion2b_s34b_b79k \
    --imagenet-val /data/imagenet/val/ \
    --zeroshot-frequency 1 \
    # ... other arguments
Monitor both:
  • Fine-tuning dataset performance (improves)
  • Zero-shot ImageNet accuracy (may degrade if overfitting)

Validation Loss

python -m open_clip_train.main \
    --train-data "/data/train.tar" \
    --val-data "/data/val.tar" \
    --val-frequency 1 \
    # ... other arguments

Weights & Biases Logging

python -m open_clip_train.main \
    --report-to wandb \
    --wandb-project-name "clip-finetuning" \
    --wandb-notes "Fine-tuning ViT-B/32 on medical images" \
    # ... other arguments

Common Fine-tuning Issues

Overfitting

Symptoms:
  • Training loss decreases, validation loss increases
  • Zero-shot performance degrades significantly
Solutions:
  1. Reduce learning rate
  2. Use fewer epochs
  3. Freeze more layers
  4. Add regularization (increase --wd)
  5. Use more data augmentation

Underfitting

Symptoms:
  • Both training and validation loss remain high
  • No improvement over pretrained model
Solutions:
  1. Increase learning rate
  2. Train for more epochs
  3. Unfreeze more layers
  4. Reduce regularization

Catastrophic Forgetting

Symptoms:
  • Good performance on fine-tuning dataset
  • Poor zero-shot performance on general tasks
Solutions:
  1. Use lower learning rate
  2. Freeze early layers
  3. Use WiSE-FT weight ensembling
  4. Mix fine-tuning data with general data

Best Practices

Fine-tuning checklist:
  1. ✅ Start with a pretrained model close to your domain
  2. ✅ Use 10-100× lower learning rate than pretraining
  3. ✅ Fine-tune for 5-20 epochs (much less than pretraining)
  4. ✅ Monitor both task performance and zero-shot performance
  5. ✅ Try partial fine-tuning before full fine-tuning
  6. ✅ Use validation set to prevent overfitting
  7. ✅ Consider WiSE-FT for robust fine-tuning
  8. ✅ Save checkpoints frequently for comparison
Avoid:
  • ❌ Using same learning rate as pretraining
  • ❌ Fine-tuning for too many epochs
  • ❌ Ignoring zero-shot performance degradation
  • ❌ Not using validation data
  • ❌ Forgetting to set --pretrained flag

Fine-tuning Templates

Quick Fine-tuning (Small Dataset)

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/small.csv" \
    --dataset-type csv \
    --lr 1e-5 \
    --epochs 10 \
    --batch-size 128

Production Fine-tuning (Large Dataset)

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-L-14 \
    --pretrained laion2b_s32b_b82k \
    --train-data "/data/train-{0000..9999}.tar" \
    --train-num-samples 100000000 \
    --val-data "/data/val-{0000..0099}.tar" \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --lr 5e-5 \
    --warmup 5000 \
    --epochs 10 \
    --save-frequency 1 \
    --imagenet-val /data/imagenet/val/ \
    --zeroshot-frequency 1 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb \
    --name "production-finetune"

Conservative Fine-tuning (Preserve Generalization)

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-image \
    --lock-image-unlocked-groups 1 \
    --train-data "/data/domain.tar" \
    --lr 1e-5 \
    --warmup 2000 \
    --epochs 5 \
    --imagenet-val /data/imagenet/val/ \
    --zeroshot-frequency 1

Next Steps

Training Overview

Learn about training CLIP models from scratch

Configuration

Explore all fine-tuning parameters

Pretrained Models

Browse available pretrained models

WiSE-FT

Learn about robust fine-tuning with weight ensembling

Build docs developers (and LLMs) love