Fine-tuning - OpenCLIP

Overview

Fine-tuning allows you to adapt pretrained CLIP models to specific domains or datasets by continuing training from a pretrained checkpoint. This is often more efficient than training from scratch and can achieve better performance with less data.

When to Fine-tune

✅ Fine-tune when:

You have a pretrained model that’s close to your target domain
You have limited training data (1M-100M samples)
You want to adapt to a specific domain (medical images, satellite imagery, etc.)
You need faster convergence than training from scratch
You want to improve zero-shot performance on specific tasks

❌ Train from scratch when:

Your domain is very different from the pretrained model’s training data
You have massive amounts of training data (>1B samples)
You need a completely custom architecture
You want to experiment with new training objectives

Loading Pretrained Weights

From OpenCLIP Pretrained Models

Use the --pretrained flag with a model tag:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/custom_dataset.tar" \
    --lr 1e-5 \
    --epochs 10

Available pretrained tags:

import open_clip

# List all pretrained models
for model_name, pretrained in open_clip.list_pretrained():
    print(f"{model_name}: {pretrained}")

Common pretrained weights:

laion2b_s34b_b79k: ViT-B/32 on LAION-2B
laion2b_s32b_b82k: ViT-L/14 on LAION-2B
openai: Original OpenAI CLIP weights
datacomp_xl_s13b_b90k: DataComp-1B models

From Local Checkpoint

Use a local checkpoint file:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained /path/to/checkpoint.pt \
    --train-data "/data/custom_dataset.tar" \
    --lr 1e-5 \
    --epochs 10

From Hugging Face Hub

Download from Hugging Face and use local path:

# Download from HF
wget https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/open_clip_pytorch_model.bin

# Use in training
python -m open_clip_train.main \
    --model ViT-L-14 \
    --pretrained /path/to/open_clip_pytorch_model.bin \
    # ... other arguments

Resuming Training from Checkpoint

The --resume flag continues training from a checkpoint, including optimizer state:

python -m open_clip_train.main \
    --train-data "/path/to/train_data.csv" \
    --val-data "/path/to/validation_data.csv" \
    --resume /path/to/checkpoints/epoch_K.pt \
    --model ViT-B-32 \
    # ... other arguments should match original training

Resume vs Pretrained:

Flag	Use Case	Loads Optimizer	Loads Epoch	Learning Rate
`--resume`	Continue interrupted training	✅ Yes	✅ Yes	Original schedule continues
`--pretrained`	Fine-tune from pretrained	❌ No	❌ No	New schedule from epoch 0

Resume from Latest Checkpoint

python -m open_clip_train.main \
    --resume latest \
    # ... other arguments

Automatically finds and loads the most recent checkpoint in the logs directory.

Fine-tuning Strategies

1. Full Model Fine-tuning

Fine-tune all parameters with a lower learning rate:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/domain_specific.tar" \
    --train-num-samples 10000000 \
    --dataset-type webdataset \
    --lr 1e-5 \
    --warmup 1000 \
    --epochs 10 \
    --batch-size 256 \
    --precision amp \
    --workers 8

Key changes from pretraining:

⬇️ Lower learning rate: 1e-5 vs 1e-3 for pretraining
⏱️ Fewer epochs: 10 vs 32 for pretraining
🔥 Shorter warmup: 1000 vs 10000 steps

2. Frozen Image Encoder (Text-Only Fine-tuning)

Freeze the image encoder and only fine-tune the text encoder:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-image \
    --train-data "/data/domain_specific.tar" \
    --lr 1e-4 \
    --epochs 10

Benefits:

💾 Lower memory usage
⚡ Faster training
🎯 Useful when adapting to new vocabulary/concepts

3. Frozen Text Encoder (Image-Only Fine-tuning)

Freeze the text encoder and only fine-tune the image encoder:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-text \
    --train-data "/data/domain_specific.tar" \
    --lr 1e-4 \
    --epochs 10

Use cases:

Adapting to new image domains (medical, satellite, etc.)
Maintaining text understanding while improving visual features

4. Partial Fine-tuning

Freeze early layers and fine-tune later layers:

# Fine-tune last 2 groups of image encoder
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-image \
    --lock-image-unlocked-groups 2 \
    --train-data "/data/domain_specific.tar" \
    --lr 5e-5 \
    --epochs 10

# Fine-tune last 10 layers of text encoder  
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --train-data "/data/domain_specific.tar" \
    --lr 5e-5 \
    --epochs 10

Benefits:

⚖️ Balance between adaptation and preservation
💾 Lower memory and compute requirements
🛡️ Less prone to overfitting on small datasets

5. LiT (Locked Image Tuning)

Lock image encoder with ImageNet pretrained weights, train text encoder from scratch:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained-image \
    --lock-image \
    --lock-image-freeze-bn-stats \
    --train-data "/data/train.tar" \
    --lr 1e-3 \
    --epochs 32

Reference: LiT: Zero-Shot Transfer with Locked-image Text Tuning

Learning Rate Adjustment

Fine-tuning requires careful learning rate selection:

Recommended Learning Rates

Strategy	Learning Rate	Warmup Steps	Epochs
Full fine-tuning	1e-5 to 1e-4	1000-5000	5-10
Partial fine-tuning	5e-5 to 5e-4	1000-3000	5-10
Frozen encoder	1e-4 to 1e-3	1000-5000	10-20
From scratch (reference)	1e-3 to 5e-3	10000	32+

Learning Rate Schedules

Cosine with warmup (recommended):

--lr 1e-5 \
--warmup 1000 \
--lr-scheduler cosine \
--epochs 10

Constant with warmup:

--lr 1e-5 \
--warmup 1000 \
--lr-scheduler const \
--epochs 10

Constant with cooldown:

--lr 1e-4 \
--warmup 1000 \
--lr-scheduler const-cooldown \
--epochs-cooldown 2 \
--lr-cooldown-end 1e-6 \
--epochs 10

Fine-tuning Examples

Domain Adaptation: Medical Images

# Fine-tune ViT-B/32 on medical imaging dataset
torchrun --nproc_per_node 4 -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/medical_images/train-{0000..0100}.tar" \
    --train-num-samples 1000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 6 \
    --lr 5e-5 \
    --warmup 2000 \
    --epochs 10 \
    --save-frequency 2 \
    --imagenet-val /data/imagenet/val/ \
    --report-to wandb \
    --name "vit-b32-medical-finetuned"

Small Dataset Fine-tuning

# Fine-tune on small dataset (100k samples)
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/small_dataset.csv" \
    --dataset-type csv \
    --csv-img-key filepath \
    --csv-caption-key title \
    --batch-size 128 \
    --precision amp \
    --workers 4 \
    --lr 1e-5 \
    --warmup 500 \
    --epochs 20 \
    --lock-image \
    --lock-image-unlocked-groups 1 \
    --report-to tensorboard

Multilingual Fine-tuning

# Adapt to new language while keeping visual encoder
python -m open_clip_train.main \
    --model ViT-L-14 \
    --pretrained laion2b_s32b_b82k \
    --lock-image \
    --train-data "/data/chinese_captions.tar" \
    --train-num-samples 50000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --lr 1e-4 \
    --warmup 5000 \
    --epochs 15 \
    --precision amp \
    --workers 8

High-Resolution Fine-tuning

# Fine-tune at higher resolution (336px instead of 224px)
python -m open_clip_train.main \
    --model ViT-L-14 \
    --pretrained laion2b_s32b_b82k \
    --force-image-size 336 \
    --train-data "/data/high_res.tar" \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --lr 1e-5 \
    --epochs 5

WiSE-FT: Robust Fine-tuning

For robust fine-tuning that maintains performance under distribution shift, use the WiSE-FT repository. WiSE-FT (Weight-Space Ensembling for Fine-Tuning) averages the weights of:

Zero-shot pretrained model
Fine-tuned model

This preserves robustness while improving accuracy.

WiSE-FT Workflow

# 1. Fine-tune on target dataset
python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/imagenet/train.csv" \
    --lr 1e-5 \
    --epochs 10 \
    --name "imagenet-finetuned"

# 2. Use WiSE-FT to ensemble weights
# See https://github.com/mlfoundations/wise-ft for details

Reference: Robust Fine-tuning of Zero-shot Models

Monitoring Fine-tuning

Zero-shot Evaluation

Track zero-shot performance during fine-tuning:

python -m open_clip_train.main \
    --pretrained laion2b_s34b_b79k \
    --imagenet-val /data/imagenet/val/ \
    --zeroshot-frequency 1 \
    # ... other arguments

Monitor both:

Fine-tuning dataset performance (improves)
Zero-shot ImageNet accuracy (may degrade if overfitting)

Validation Loss

python -m open_clip_train.main \
    --train-data "/data/train.tar" \
    --val-data "/data/val.tar" \
    --val-frequency 1 \
    # ... other arguments

Weights & Biases Logging

python -m open_clip_train.main \
    --report-to wandb \
    --wandb-project-name "clip-finetuning" \
    --wandb-notes "Fine-tuning ViT-B/32 on medical images" \
    # ... other arguments

Common Fine-tuning Issues

Overfitting

Symptoms:

Training loss decreases, validation loss increases
Zero-shot performance degrades significantly

Solutions:

Reduce learning rate
Use fewer epochs
Freeze more layers
Add regularization (increase --wd)
Use more data augmentation

Underfitting

Symptoms:

Both training and validation loss remain high
No improvement over pretrained model

Solutions:

Increase learning rate
Train for more epochs
Unfreeze more layers
Reduce regularization

Catastrophic Forgetting

Symptoms:

Good performance on fine-tuning dataset
Poor zero-shot performance on general tasks

Solutions:

Use lower learning rate
Freeze early layers
Use WiSE-FT weight ensembling
Mix fine-tuning data with general data

Best Practices

Fine-tuning checklist:

✅ Start with a pretrained model close to your domain
✅ Use 10-100× lower learning rate than pretraining
✅ Fine-tune for 5-20 epochs (much less than pretraining)
✅ Monitor both task performance and zero-shot performance
✅ Try partial fine-tuning before full fine-tuning
✅ Use validation set to prevent overfitting
✅ Consider WiSE-FT for robust fine-tuning
✅ Save checkpoints frequently for comparison

Avoid:

❌ Using same learning rate as pretraining
❌ Fine-tuning for too many epochs
❌ Ignoring zero-shot performance degradation
❌ Not using validation data
❌ Forgetting to set --pretrained flag

Fine-tuning Templates

Quick Fine-tuning (Small Dataset)

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --train-data "/data/small.csv" \
    --dataset-type csv \
    --lr 1e-5 \
    --epochs 10 \
    --batch-size 128

Production Fine-tuning (Large Dataset)

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-L-14 \
    --pretrained laion2b_s32b_b82k \
    --train-data "/data/train-{0000..9999}.tar" \
    --train-num-samples 100000000 \
    --val-data "/data/val-{0000..0099}.tar" \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --lr 5e-5 \
    --warmup 5000 \
    --epochs 10 \
    --save-frequency 1 \
    --imagenet-val /data/imagenet/val/ \
    --zeroshot-frequency 1 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb \
    --name "production-finetune"

Conservative Fine-tuning (Preserve Generalization)

python -m open_clip_train.main \
    --model ViT-B-32 \
    --pretrained laion2b_s34b_b79k \
    --lock-image \
    --lock-image-unlocked-groups 1 \
    --train-data "/data/domain.tar" \
    --lr 1e-5 \
    --warmup 2000 \
    --epochs 5 \
    --imagenet-val /data/imagenet/val/ \
    --zeroshot-frequency 1

Next Steps

Training Overview

Learn about training CLIP models from scratch

Configuration

Explore all fine-tuning parameters

Pretrained Models

Browse available pretrained models

WiSE-FT

Learn about robust fine-tuning with weight ensembling

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Overview

​When to Fine-tune

​✅ Fine-tune when:

​❌ Train from scratch when:

​Loading Pretrained Weights

​From OpenCLIP Pretrained Models

​From Local Checkpoint

​From Hugging Face Hub

​Resuming Training from Checkpoint

​Resume from Latest Checkpoint

​Fine-tuning Strategies

​1. Full Model Fine-tuning

​2. Frozen Image Encoder (Text-Only Fine-tuning)

​3. Frozen Text Encoder (Image-Only Fine-tuning)

​4. Partial Fine-tuning

​5. LiT (Locked Image Tuning)

​Learning Rate Adjustment

​Recommended Learning Rates

​Learning Rate Schedules

​Fine-tuning Examples

​Domain Adaptation: Medical Images

​Small Dataset Fine-tuning

​Multilingual Fine-tuning

​High-Resolution Fine-tuning

​WiSE-FT: Robust Fine-tuning

​WiSE-FT Workflow

​Monitoring Fine-tuning

​Zero-shot Evaluation

​Validation Loss

​Weights & Biases Logging

​Common Fine-tuning Issues

​Overfitting

​Underfitting

​Catastrophic Forgetting

​Best Practices

​Fine-tuning Templates

​Quick Fine-tuning (Small Dataset)

​Production Fine-tuning (Large Dataset)

​Conservative Fine-tuning (Preserve Generalization)

​Next Steps

Training Overview

Configuration

Pretrained Models

WiSE-FT

Build docs developers (and LLMs) love

Overview

When to Fine-tune

✅ Fine-tune when:

❌ Train from scratch when:

Loading Pretrained Weights

From OpenCLIP Pretrained Models

From Local Checkpoint

From Hugging Face Hub

Resuming Training from Checkpoint

Resume from Latest Checkpoint

Fine-tuning Strategies

1. Full Model Fine-tuning

2. Frozen Image Encoder (Text-Only Fine-tuning)

3. Frozen Text Encoder (Image-Only Fine-tuning)

4. Partial Fine-tuning

5. LiT (Locked Image Tuning)

Learning Rate Adjustment

Recommended Learning Rates

Learning Rate Schedules

Fine-tuning Examples

Domain Adaptation: Medical Images

Small Dataset Fine-tuning

Multilingual Fine-tuning

High-Resolution Fine-tuning

WiSE-FT: Robust Fine-tuning

WiSE-FT Workflow

Monitoring Fine-tuning

Zero-shot Evaluation

Validation Loss

Weights & Biases Logging

Common Fine-tuning Issues

Overfitting

Underfitting

Catastrophic Forgetting

Best Practices

Fine-tuning Templates

Quick Fine-tuning (Small Dataset)

Production Fine-tuning (Large Dataset)

Conservative Fine-tuning (Preserve Generalization)

Next Steps