Skip to main content

Overview

Returns a tokenizer instance based on the model name or identifier schema. Automatically selects the appropriate tokenizer type (HFTokenizer, SigLipTokenizer, or SimpleTokenizer) based on the model configuration.

Function Signature

def get_tokenizer(
    model_name: str = '',
    context_length: Optional[int] = None,
    cache_dir: Optional[str] = None,
    **kwargs
) -> Union[SimpleTokenizer, HFTokenizer, SigLipTokenizer]

Parameters

model_name
str
default:"''"
Model identifier that determines which tokenizer to use. Supports multiple schemas:
  • 'ViT-B-32': Built-in model name (looks up config)
  • 'hf-hub:org/repo': Load from Hugging Face Hub
  • 'local-dir:/path/to/folder': Load from local directory
context_length
int
default:"None"
Maximum sequence length for tokenization. If None, uses the value from model config or defaults to 77.
cache_dir
str
default:"None"
Directory to cache downloaded tokenizer files when loading from Hugging Face Hub.
**kwargs
dict
Additional tokenizer-specific keyword arguments passed to the tokenizer constructor. Overrides config values.

Returns

tokenizer
Union[SimpleTokenizer, HFTokenizer, SigLipTokenizer]
Tokenizer instance appropriate for the specified model:
  • SimpleTokenizer: Default OpenAI CLIP tokenizer (BPE-based)
  • HFTokenizer: Hugging Face transformers tokenizer wrapper
  • SigLipTokenizer: SigLIP T5-compatible sentencepiece tokenizer

Examples

Get tokenizer for built-in model

import open_clip

# Get default tokenizer for ViT-B-32
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Tokenize text
texts = ["a photo of a cat", "a photo of a dog"]
tokens = tokenizer(texts)
print(tokens.shape)  # torch.Size([2, 77])

Load from Hugging Face Hub

# Load tokenizer from HF Hub model
tokenizer = open_clip.get_tokenizer(
    'hf-hub:laion/CLIP-ViT-B-32-laion2B-s34B-b79K',
    cache_dir='./hf_cache'
)

Load from local directory

# Load from local model directory
tokenizer = open_clip.get_tokenizer(
    'local-dir:/path/to/model',
    context_length=77
)

Custom context length

# Override default context length
tokenizer = open_clip.get_tokenizer(
    'ViT-L-14',
    context_length=128  # Longer sequences
)

Pass custom tokenizer kwargs

# Pass additional tokenizer arguments
tokenizer = open_clip.get_tokenizer(
    'hf-hub:openai/clip-vit-base-patch32',
    clean='canonicalize',  # Custom text cleaning
    additional_special_tokens=['<mask>']  # Add special tokens
)

Tokenizer Selection Logic

The function automatically selects the tokenizer type based on:
  1. HFTokenizer: Used if hf_tokenizer_name is specified in the model’s text config
  2. SigLipTokenizer: Used for models with ‘siglip’ in the name (when no HF tokenizer specified)
  3. SimpleTokenizer: Default fallback for OpenAI CLIP models

Schema Support

Built-in

Use model names from OpenCLIP’s model registry

HF Hub

Load from Hugging Face Hub: hf-hub:org/repo

Local

Load from local directory: local-dir:/path

Notes

  • For local-dir schema, an open_clip_config.json file must exist in the directory
  • Context length priority: function argument > model config > default (77)
  • Tokenizer kwargs from the function call override those in the model config
  • If model config cannot be loaded, falls back to SimpleTokenizer with default settings

Build docs developers (and LLMs) love