Custom Models

OpenCLIP supports creating custom model architectures through JSON configuration files and flexible model building APIs. You can define custom vision encoders, text encoders, or use pre-trained models from HuggingFace as text encoders.

Model Configuration Files

Model architectures are defined in JSON configuration files located in src/open_clip/model_configs/. Each config file specifies the model’s architecture parameters.

Basic Model Config Structure

{
    "embed_dim": 512,
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 32
    },
    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 512,
        "heads": 8,
        "layers": 12
    }
}

Key Parameters

embed_dim: The dimension of the joint embedding space where image and text features are projected
vision_cfg: Configuration for the vision encoder
- image_size: Input image resolution
- layers: Number of transformer layers
- width: Hidden dimension size
- patch_size: Size of image patches for Vision Transformer
text_cfg: Configuration for the text encoder
- context_length: Maximum text sequence length
- vocab_size: Size of the vocabulary
- width: Hidden dimension size
- heads: Number of attention heads
- layers: Number of transformer layers

Adding Custom Model Configs

You can add your own model configurations using the add_model_config() function:

import open_clip
from pathlib import Path

# Add a directory containing model config JSON files
open_clip.add_model_config(Path('/path/to/model_configs/'))

# Or add a single config file
open_clip.add_model_config(Path('/path/to/my_model.json'))

# Now you can use your custom model
model, _, preprocess = open_clip.create_model_and_transforms(
    'my_model',
    pretrained=None
)

Using HuggingFace Models as Text Encoders

OpenCLIP allows you to use any HuggingFace transformer model as the text encoder. This is useful for leveraging pre-trained language models or multilingual models.

HuggingFace Text Encoder Config

{
    "embed_dim": 512,
    "quick_gelu": true,
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 32
    },
    "text_cfg": {
        "hf_model_name": "roberta-base",
        "hf_tokenizer_name": "roberta-base",
        "hf_pooler_type": "mean_pooler"
    }
}

Training with HuggingFace Text Encoder

When training with a HuggingFace model as the text encoder, use the --hf-tokenizer-name parameter to specify the tokenizer:

python -m open_clip_train.main \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 10 \
    --lr 5e-4

Freezing and Unfreezing Layers

You can control which layers of the text encoder are trainable:

python -m open_clip_train.main \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --train-data "/path/to/train_data.tar" \
    --batch-size 256 \
    --epochs 10

Parameters:

--lock-text: Freeze the entire text encoder
--lock-text-unlocked-layers N: Leave the last N layer groups unfrozen for fine-tuning
--lock-text-freeze-layer-norm: Freeze LayerNorm running stats in locked layers

Custom Vision Architectures

OpenCLIP supports various vision encoder architectures:

Vision Transformer (ViT)

Standard Vision Transformer configuration:

{
    "vision_cfg": {
        "image_size": 224,
        "layers": 12,
        "width": 768,
        "patch_size": 16,
        "head_width": 64
    }
}

ConvNeXt

Using timm models for vision encoding:

{
    "vision_cfg": {
        "timm_model_name": "convnext_base",
        "timm_model_pretrained": true,
        "timm_pool": "avg",
        "timm_proj": "linear",
        "image_size": 256
    }
}

Creating Models Programmatically

You can also create custom models directly in Python:

import open_clip
import json
from pathlib import Path

# Define custom config
custom_config = {
    "embed_dim": 768,
    "vision_cfg": {
        "image_size": 256,
        "layers": 16,
        "width": 1024,
        "patch_size": 16
    },
    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 768,
        "heads": 12,
        "layers": 16
    }
}

# Save config to file
config_path = Path('custom_model.json')
with open(config_path, 'w') as f:
    json.dump(custom_config, f, indent=2)

# Add config and create model
open_clip.add_model_config(config_path)
model, _, preprocess = open_clip.create_model_and_transforms(
    'custom_model',
    pretrained=None
)

Available Model Configs

To see all available model configurations:

import open_clip

# List all available model architectures
models = open_clip.list_models()
print(models)

Best Practices

Embed Dimension: Ensure embed_dim is consistent across vision and text towers
Model Naming: Use descriptive names that indicate architecture (e.g., roberta-ViT-B-32)
Configuration Testing: Test custom configs with small datasets before full training
Pre-trained Weights: When using HuggingFace models, leverage their pre-trained weights for better initialization
Layer Freezing: Start with more frozen layers and gradually unfreeze for fine-tuning

Example: Training Custom Model

Complete example training a custom model with RoBERTa text encoder:

python -m open_clip_train.main \
    --train-data "pipe:aws s3 cp s3://bucket/data/{00000..00329}.tar -" \
    --train-num-samples 3000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --warmup 2000 \
    --epochs 10 \
    --lr 5e-4 \
    --precision amp \
    --workers 6 \
    --model "roberta-ViT-B-32" \
    --hf-tokenizer-name "roberta-base" \
    --lock-text \
    --lock-text-unlocked-layers 10 \
    --name "custom-clip" \
    --report-to "tensorboard"

This configuration:

Uses RoBERTa as the text encoder
Keeps the first layers of RoBERTa frozen, unfreezing the last 10 layers
Trains on data from S3
Uses automatic mixed precision for efficiency
Reports metrics to TensorBoard

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

Model Configuration Files

Basic Model Config Structure

Key Parameters

Adding Custom Model Configs

Using HuggingFace Models as Text Encoders

HuggingFace Text Encoder Config

Training with HuggingFace Text Encoder

Freezing and Unfreezing Layers

Custom Vision Architectures

Vision Transformer (ViT)

ConvNeXt

Creating Models Programmatically

Available Model Configs

Best Practices

Example: Training Custom Model

Build docs developers (and LLMs) love

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Model Configuration Files

​Basic Model Config Structure

​Key Parameters

​Adding Custom Model Configs

​Using HuggingFace Models as Text Encoders

​HuggingFace Text Encoder Config

​Training with HuggingFace Text Encoder

​Freezing and Unfreezing Layers

​Custom Vision Architectures

​Vision Transformer (ViT)

​ConvNeXt

​Creating Models Programmatically

​Available Model Configs

​Best Practices

​Example: Training Custom Model

Build docs developers (and LLMs) love

Model Configuration Files

Basic Model Config Structure

Key Parameters

Adding Custom Model Configs

Using HuggingFace Models as Text Encoders

HuggingFace Text Encoder Config

Training with HuggingFace Text Encoder

Freezing and Unfreezing Layers

Custom Vision Architectures

Vision Transformer (ViT)

ConvNeXt

Creating Models Programmatically

Available Model Configs

Best Practices

Example: Training Custom Model