Model Configuration Files
Model architectures are defined in JSON configuration files located insrc/open_clip/model_configs/. Each config file specifies the model’s architecture parameters.
Basic Model Config Structure
Key Parameters
- embed_dim: The dimension of the joint embedding space where image and text features are projected
- vision_cfg: Configuration for the vision encoder
image_size: Input image resolutionlayers: Number of transformer layerswidth: Hidden dimension sizepatch_size: Size of image patches for Vision Transformer
- text_cfg: Configuration for the text encoder
context_length: Maximum text sequence lengthvocab_size: Size of the vocabularywidth: Hidden dimension sizeheads: Number of attention headslayers: Number of transformer layers
Adding Custom Model Configs
You can add your own model configurations using theadd_model_config() function:
Using HuggingFace Models as Text Encoders
OpenCLIP allows you to use any HuggingFace transformer model as the text encoder. This is useful for leveraging pre-trained language models or multilingual models.HuggingFace Text Encoder Config
Training with HuggingFace Text Encoder
When training with a HuggingFace model as the text encoder, use the--hf-tokenizer-name parameter to specify the tokenizer:
Freezing and Unfreezing Layers
You can control which layers of the text encoder are trainable:--lock-text: Freeze the entire text encoder--lock-text-unlocked-layers N: Leave the last N layer groups unfrozen for fine-tuning--lock-text-freeze-layer-norm: Freeze LayerNorm running stats in locked layers
Custom Vision Architectures
OpenCLIP supports various vision encoder architectures:Vision Transformer (ViT)
Standard Vision Transformer configuration:ConvNeXt
Using timm models for vision encoding:Creating Models Programmatically
You can also create custom models directly in Python:Available Model Configs
To see all available model configurations:Best Practices
- Embed Dimension: Ensure
embed_dimis consistent across vision and text towers - Model Naming: Use descriptive names that indicate architecture (e.g.,
roberta-ViT-B-32) - Configuration Testing: Test custom configs with small datasets before full training
- Pre-trained Weights: When using HuggingFace models, leverage their pre-trained weights for better initialization
- Layer Freezing: Start with more frozen layers and gradually unfreeze for fine-tuning
Example: Training Custom Model
Complete example training a custom model with RoBERTa text encoder:- Uses RoBERTa as the text encoder
- Keeps the first layers of RoBERTa frozen, unfreezing the last 10 layers
- Trains on data from S3
- Uses automatic mixed precision for efficiency
- Reports metrics to TensorBoard
