Skip to main content
OpenCLIP supports creating custom model architectures through JSON configuration files and flexible model building APIs. You can define custom vision encoders, text encoders, or use pre-trained models from HuggingFace as text encoders.

Model Configuration Files

Model architectures are defined in JSON configuration files located in src/open_clip/model_configs/. Each config file specifies the model’s architecture parameters.

Basic Model Config Structure

{
    "embed_dim": 512,
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 32
    },
    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 512,
        "heads": 8,
        "layers": 12
    }
}

Key Parameters

  • embed_dim: The dimension of the joint embedding space where image and text features are projected
  • vision_cfg: Configuration for the vision encoder
    • image_size: Input image resolution
    • layers: Number of transformer layers
    • width: Hidden dimension size
    • patch_size: Size of image patches for Vision Transformer
  • text_cfg: Configuration for the text encoder
    • context_length: Maximum text sequence length
    • vocab_size: Size of the vocabulary
    • width: Hidden dimension size
    • heads: Number of attention heads
    • layers: Number of transformer layers

Adding Custom Model Configs

You can add your own model configurations using the add_model_config() function:
import open_clip
from pathlib import Path

# Add a directory containing model config JSON files
open_clip.add_model_config(Path('/path/to/model_configs/'))

# Or add a single config file
open_clip.add_model_config(Path('/path/to/my_model.json'))

# Now you can use your custom model
model, _, preprocess = open_clip.create_model_and_transforms(
    'my_model',
    pretrained=None
)

Using HuggingFace Models as Text Encoders

OpenCLIP allows you to use any HuggingFace transformer model as the text encoder. This is useful for leveraging pre-trained language models or multilingual models.

HuggingFace Text Encoder Config

{
    "embed_dim": 512,
    "quick_gelu": true,
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 32
    },
    "text_cfg": {
        "hf_model_name": "roberta-base",
        "hf_tokenizer_name": "roberta-base",
        "hf_pooler_type": "mean_pooler"
    }
}

Training with HuggingFace Text Encoder

When training with a HuggingFace model as the text encoder, use the --hf-tokenizer-name parameter to specify the tokenizer:
python -m open_clip_train.main \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 10 \
    --lr 5e-4

Freezing and Unfreezing Layers

You can control which layers of the text encoder are trainable:
python -m open_clip_train.main \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --train-data "/path/to/train_data.tar" \
    --batch-size 256 \
    --epochs 10
Parameters:
  • --lock-text: Freeze the entire text encoder
  • --lock-text-unlocked-layers N: Leave the last N layer groups unfrozen for fine-tuning
  • --lock-text-freeze-layer-norm: Freeze LayerNorm running stats in locked layers

Custom Vision Architectures

OpenCLIP supports various vision encoder architectures:

Vision Transformer (ViT)

Standard Vision Transformer configuration:
{
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 16,
        "head_width": 64
    }
}

ConvNeXt

Using timm models for vision encoding:
{
    "vision_cfg": {
        "timm_model_name": "convnext_base",
        "timm_model_pretrained": true,
        "timm_pool": "avg",
        "timm_proj": "linear",
        "image_size": 256
    }
}

Creating Models Programmatically

You can also create custom models directly in Python:
import open_clip
import json
from pathlib import Path

# Define custom config
custom_config = {
    "embed_dim": 768,
    "vision_cfg": {
        "image_size": 256,
        "layers": 16,
        "width": 1024,
        "patch_size": 16
    },
    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 768,
        "heads": 12,
        "layers": 16
    }
}

# Save config to file
config_path = Path('custom_model.json')
with open(config_path, 'w') as f:
    json.dump(custom_config, f, indent=2)

# Add config and create model
open_clip.add_model_config(config_path)
model, _, preprocess = open_clip.create_model_and_transforms(
    'custom_model',
    pretrained=None
)

Available Model Configs

To see all available model configurations:
import open_clip

# List all available model architectures
models = open_clip.list_models()
print(models)

Best Practices

  1. Embed Dimension: Ensure embed_dim is consistent across vision and text towers
  2. Model Naming: Use descriptive names that indicate architecture (e.g., roberta-ViT-B-32)
  3. Configuration Testing: Test custom configs with small datasets before full training
  4. Pre-trained Weights: When using HuggingFace models, leverage their pre-trained weights for better initialization
  5. Layer Freezing: Start with more frozen layers and gradually unfreeze for fine-tuning

Example: Training Custom Model

Complete example training a custom model with RoBERTa text encoder:
python -m open_clip_train.main \
    --train-data "pipe:aws s3 cp s3://bucket/data/{00000..00329}.tar -" \
    --train-num-samples 3000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --warmup 2000 \
    --epochs 10 \
    --lr 5e-4 \
    --precision amp \
    --workers 6 \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --name "custom-clip" \
    --report-to "tensorboard"
This configuration:
  • Uses RoBERTa as the text encoder
  • Keeps the first layers of RoBERTa frozen, unfreezing the last 10 layers
  • Trains on data from S3
  • Uses automatic mixed precision for efficiency
  • Reports metrics to TensorBoard

Build docs developers (and LLMs) love