Model Distillation

Model distillation allows you to transfer knowledge from a larger, more powerful CLIP model (teacher) into a smaller, more efficient model (student). This enables you to create compact models that maintain much of the performance of larger models while being faster and requiring less memory.

Overview

Distillation in OpenCLIP uses the teacher model’s embeddings as soft targets to guide the training of the student model. The student learns to mimic both the teacher’s predictions and maintain contrastive alignment between images and text.

Basic Usage

To enable distillation, specify the teacher model using --distill-model and --distill-pretrained flags:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --lr 5e-4 \
    --epochs 32

Distillation Parameters

Required Parameters

--distill-model: Architecture of the teacher model (e.g., ViT-L-14, ViT-H-14)
--distill-pretrained: Pre-trained weights for the teacher model (e.g., openai, laion2b_s32b_b82k)

How It Works

Teacher Model: A large, pre-trained model is loaded and frozen
Student Model: Your target model is trained normally
Distillation Loss: The student learns to match the teacher’s embeddings in addition to the standard contrastive loss
Combined Training: Both losses are used to guide the student’s learning

Example: Distilling from OpenAI ViT-L/14

One of the most common use cases is distilling from OpenAI’s large ViT-L/14 model:

python -m open_clip_train.main \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 8 \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --warmup 2000 \
    --lr 5e-4 \
    --wd 0.1 \
    --epochs 32 \
    --imagenet-val /data/imagenet/validation/ \
    --report-to wandb

Distillation from Different Teacher Models

From LAION Models

Distill from a larger LAION-trained model:

python -m open_clip_train.main \
    --model ViT-B-16 \
    --distill-model ViT-L-14 \
    --distill-pretrained laion2b_s32b_b82k \
    --train-data "/data/train.tar" \
    --batch-size 256 \
    --epochs 32

From DataComp Models

Distill from state-of-the-art DataComp models:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained datacomp_xl_s13b_b90k \
    --train-data "/data/datacomp.tar" \
    --batch-size 256 \
    --epochs 32

From SigLIP Models

Distill from efficient SigLIP models:

python -m open_clip_train.main \
    --model ViT-B-16 \
    --distill-model ViT-SO400M-14-SigLIP \
    --distill-pretrained webli \
    --siglip \
    --train-data "/data/train.tar" \
    --batch-size 256 \
    --epochs 32

Architecture Combinations

You can distill between different architecture families:

ViT to ViT (Different Sizes)

# Large to Base
--model ViT-B-32 --distill-model ViT-L-14 --distill-pretrained openai

# Huge to Large
--model ViT-L-14 --distill-model ViT-H-14 --distill-pretrained laion2b_s32b_b79k

# Base to Small
--model ViT-S-16 --distill-model ViT-B-16 --distill-pretrained openai

ConvNet to ViT

--model convnext_base --distill-model ViT-L-14 --distill-pretrained openai

ViT to ConvNet

--model ViT-B-32 --distill-model convnext_large_d_320 --distill-pretrained laion2b_s29b_b131k_ft

Distillation Loss

The distillation loss in OpenCLIP combines:

Standard Contrastive Loss: Image-text alignment for the student model
Distillation Loss: Student embeddings match teacher embeddings

The DistillClipLoss automatically balances these objectives:

from open_clip.loss import DistillClipLoss

loss = DistillClipLoss(
    local_loss=args.local_loss,
    gather_with_grad=args.gather_with_grad,
    cache_labels=True,
    rank=args.rank,
    world_size=args.world_size,
)

Important Constraints

Gradient Accumulation

Distillation currently requires --accum-freq 1 (no gradient accumulation):

# This is required for distillation
--accum-freq 1

If you need to simulate larger batches, increase --batch-size or use more GPUs instead.

Performance Considerations

Memory Usage

The teacher model is kept in memory (frozen) during training
Ensure you have enough GPU memory for both student and teacher models
The teacher does not require gradient storage, which saves memory
Use --precision amp or --precision fp16 to reduce memory usage

Training Speed

Distillation adds overhead from teacher forward passes
Expect ~1.5-2x slower training compared to non-distilled training
The teacher uses torch.no_grad() context to avoid gradient computation
Use mixed precision training to improve speed

Advanced Configuration

With Distributed Training

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --train-data "/data/train.tar" \
    --batch-size 256 \
    --precision amp \
    --workers 8 \
    --local-loss \
    --gather-with-grad

With Mixed Precision

python -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --precision amp \
    --train-data "/data/train.tar" \
    --batch-size 512

With Zero-Shot Evaluation

python -m open_clip_train.main \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --train-data "/data/train.tar" \
    --imagenet-val /data/imagenet/val/ \
    --zeroshot-frequency 1 \
    --batch-size 256

Monitoring Distillation

Key metrics to track during distillation:

Student Contrastive Loss: How well the student aligns images and text
Distillation Loss: How closely student embeddings match teacher embeddings
Zero-shot Accuracy: Performance on downstream tasks
Training Speed: Samples per second compared to baseline

Best Practices

Teacher Selection: Use the best available teacher model for your domain
Learning Rate: Start with lower learning rates (5e-5 to 5e-4) for distillation
Batch Size: Use the largest batch size your hardware allows
Data Quality: Higher quality data leads to better distillation results
Training Duration: Distillation often benefits from longer training
Architecture Gap: Smaller gaps between teacher and student typically work better
Evaluation: Regularly evaluate zero-shot performance during training

Example Workflow

#!/bin/bash
# Distillation training script

python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data "/data/cc12m/train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --warmup 2000 \
    --batch-size 256 \
    --lr 5e-4 \
    --wd 0.1 \
    --epochs 32 \
    --workers 8 \
    --model ViT-B-32 \
    --distill-model ViT-L-14 \
    --distill-pretrained openai \
    --precision amp \
    --imagenet-val /data/imagenet/validation/ \
    --name "distill-b32-from-l14" \
    --logs ./logs

Troubleshooting

Out of Memory

Reduce --batch-size
Enable --precision amp or --precision fp16
Use gradient checkpointing: --grad-checkpointing
Choose a smaller teacher model

Poor Performance

Increase training duration (--epochs)
Adjust learning rate (try 1e-4 to 5e-4)
Ensure teacher model is properly loaded
Check data quality and preprocessing
Verify batch size is sufficient for contrastive learning

Slow Training

Enable mixed precision: --precision amp
Use more workers: --workers 8
Enable distributed features: --local-loss --gather-with-grad
Consider using a smaller teacher model

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

Model Distillation

Overview

Basic Usage

Distillation Parameters

Required Parameters

How It Works

Example: Distilling from OpenAI ViT-L/14

Distillation from Different Teacher Models

From LAION Models

From DataComp Models

From SigLIP Models

Architecture Combinations

ViT to ViT (Different Sizes)

ConvNet to ViT

ViT to ConvNet

Distillation Loss

Important Constraints

Gradient Accumulation

Performance Considerations

Memory Usage

Training Speed

Advanced Configuration

With Distributed Training

With Mixed Precision

With Zero-Shot Evaluation

Monitoring Distillation

Best Practices

Example Workflow

Troubleshooting

Out of Memory

Poor Performance

Slow Training

Build docs developers (and LLMs) love

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Overview

​Basic Usage

​Distillation Parameters

​Required Parameters

​How It Works

​Example: Distilling from OpenAI ViT-L/14

​Distillation from Different Teacher Models

​From LAION Models

​From DataComp Models

​From SigLIP Models

​Architecture Combinations

​ViT to ViT (Different Sizes)

​ConvNet to ViT

​ViT to ConvNet

​Distillation Loss

​Important Constraints

​Gradient Accumulation

​Performance Considerations

​Memory Usage

​Training Speed

​Advanced Configuration

​With Distributed Training

​With Mixed Precision

​With Zero-Shot Evaluation

​Monitoring Distillation

​Best Practices

​Example Workflow

​Troubleshooting

​Out of Memory

​Poor Performance

​Slow Training

Build docs developers (and LLMs) love

Overview

Basic Usage

Distillation Parameters

Required Parameters

How It Works

Example: Distilling from OpenAI ViT-L/14

Distillation from Different Teacher Models

From LAION Models

From DataComp Models

From SigLIP Models

Architecture Combinations

ViT to ViT (Different Sizes)

ConvNet to ViT

ViT to ConvNet

Distillation Loss

Important Constraints

Gradient Accumulation

Performance Considerations

Memory Usage

Training Speed

Advanced Configuration

With Distributed Training

With Mixed Precision

With Zero-Shot Evaluation

Monitoring Distillation

Best Practices

Example Workflow

Troubleshooting

Out of Memory

Poor Performance

Slow Training