Overview
Distillation in OpenCLIP uses the teacher model’s embeddings as soft targets to guide the training of the student model. The student learns to mimic both the teacher’s predictions and maintain contrastive alignment between images and text.Basic Usage
To enable distillation, specify the teacher model using--distill-model and --distill-pretrained flags:
Distillation Parameters
Required Parameters
--distill-model: Architecture of the teacher model (e.g.,ViT-L-14,ViT-H-14)--distill-pretrained: Pre-trained weights for the teacher model (e.g.,openai,laion2b_s32b_b82k)
How It Works
- Teacher Model: A large, pre-trained model is loaded and frozen
- Student Model: Your target model is trained normally
- Distillation Loss: The student learns to match the teacher’s embeddings in addition to the standard contrastive loss
- Combined Training: Both losses are used to guide the student’s learning
Example: Distilling from OpenAI ViT-L/14
One of the most common use cases is distilling from OpenAI’s large ViT-L/14 model:Distillation from Different Teacher Models
From LAION Models
Distill from a larger LAION-trained model:From DataComp Models
Distill from state-of-the-art DataComp models:From SigLIP Models
Distill from efficient SigLIP models:Architecture Combinations
You can distill between different architecture families:ViT to ViT (Different Sizes)
ConvNet to ViT
ViT to ConvNet
Distillation Loss
The distillation loss in OpenCLIP combines:- Standard Contrastive Loss: Image-text alignment for the student model
- Distillation Loss: Student embeddings match teacher embeddings
DistillClipLoss automatically balances these objectives:
Important Constraints
Gradient Accumulation
Distillation currently requires--accum-freq 1 (no gradient accumulation):
--batch-size or use more GPUs instead.
Performance Considerations
Memory Usage
- The teacher model is kept in memory (frozen) during training
- Ensure you have enough GPU memory for both student and teacher models
- The teacher does not require gradient storage, which saves memory
- Use
--precision ampor--precision fp16to reduce memory usage
Training Speed
- Distillation adds overhead from teacher forward passes
- Expect ~1.5-2x slower training compared to non-distilled training
- The teacher uses
torch.no_grad()context to avoid gradient computation - Use mixed precision training to improve speed
Advanced Configuration
With Distributed Training
With Mixed Precision
With Zero-Shot Evaluation
Monitoring Distillation
Key metrics to track during distillation:- Student Contrastive Loss: How well the student aligns images and text
- Distillation Loss: How closely student embeddings match teacher embeddings
- Zero-shot Accuracy: Performance on downstream tasks
- Training Speed: Samples per second compared to baseline
Best Practices
- Teacher Selection: Use the best available teacher model for your domain
- Learning Rate: Start with lower learning rates (5e-5 to 5e-4) for distillation
- Batch Size: Use the largest batch size your hardware allows
- Data Quality: Higher quality data leads to better distillation results
- Training Duration: Distillation often benefits from longer training
- Architecture Gap: Smaller gaps between teacher and student typically work better
- Evaluation: Regularly evaluate zero-shot performance during training
Example Workflow
Troubleshooting
Out of Memory
- Reduce
--batch-size - Enable
--precision ampor--precision fp16 - Use gradient checkpointing:
--grad-checkpointing - Choose a smaller teacher model
Poor Performance
- Increase training duration (
--epochs) - Adjust learning rate (try 1e-4 to 5e-4)
- Ensure teacher model is properly loaded
- Check data quality and preprocessing
- Verify batch size is sufficient for contrastive learning
Slow Training
- Enable mixed precision:
--precision amp - Use more workers:
--workers 8 - Enable distributed features:
--local-loss --gather-with-grad - Consider using a smaller teacher model
