get_tokenizer

Overview

Returns a tokenizer instance based on the model name or identifier schema. Automatically selects the appropriate tokenizer type (HFTokenizer, SigLipTokenizer, or SimpleTokenizer) based on the model configuration.

Function Signature

def get_tokenizer(
    model_name: str = '',
    context_length: Optional[int] = None,
    cache_dir: Optional[str] = None,
    **kwargs
) -> Union[SimpleTokenizer, HFTokenizer, SigLipTokenizer]

Parameters

model_name

str

default:"''"

Model identifier that determines which tokenizer to use. Supports multiple schemas:

'ViT-B-32': Built-in model name (looks up config)
'hf-hub:org/repo': Load from Hugging Face Hub
'local-dir:/path/to/folder': Load from local directory

context_length

int

default:"None"

Maximum sequence length for tokenization. If None, uses the value from model config or defaults to 77.

cache_dir

str

default:"None"

Directory to cache downloaded tokenizer files when loading from Hugging Face Hub.

**kwargs

dict

Additional tokenizer-specific keyword arguments passed to the tokenizer constructor. Overrides config values.

Returns

tokenizer

Union[SimpleTokenizer, HFTokenizer, SigLipTokenizer]

Tokenizer instance appropriate for the specified model:

SimpleTokenizer: Default OpenAI CLIP tokenizer (BPE-based)
HFTokenizer: Hugging Face transformers tokenizer wrapper
SigLipTokenizer: SigLIP T5-compatible sentencepiece tokenizer

Examples

Get tokenizer for built-in model

import open_clip

# Get default tokenizer for ViT-B-32
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Tokenize text
texts = ["a photo of a cat", "a photo of a dog"]
tokens = tokenizer(texts)
print(tokens.shape)  # torch.Size([2, 77])

Load from Hugging Face Hub

# Load tokenizer from HF Hub model
tokenizer = open_clip.get_tokenizer(
    'hf-hub:laion/CLIP-ViT-B-32-laion2B-s34B-b79K',
    cache_dir='./hf_cache'
)

Load from local directory

# Load from local model directory
tokenizer = open_clip.get_tokenizer(
    'local-dir:/path/to/model',
    context_length=77
)

Custom context length

# Override default context length
tokenizer = open_clip.get_tokenizer(
    'ViT-L-14',
    context_length=128  # Longer sequences
)

Pass custom tokenizer kwargs

# Pass additional tokenizer arguments
tokenizer = open_clip.get_tokenizer(
    'hf-hub:openai/clip-vit-base-patch32',
    clean='canonicalize',  # Custom text cleaning
    additional_special_tokens=['<mask>']  # Add special tokens
)

Tokenizer Selection Logic

The function automatically selects the tokenizer type based on:

HFTokenizer: Used if hf_tokenizer_name is specified in the model’s text config
SigLipTokenizer: Used for models with ‘siglip’ in the name (when no HF tokenizer specified)
SimpleTokenizer: Default fallback for OpenAI CLIP models

Schema Support

Built-in

Use model names from OpenCLIP’s model registry

HF Hub

Load from Hugging Face Hub: hf-hub:org/repo

Local

Load from local directory: local-dir:/path

Notes

For local-dir schema, an open_clip_config.json file must exist in the directory
Context length priority: function argument > model config > default (77)
Tokenizer kwargs from the function call override those in the model config
If model config cannot be loaded, falls back to SimpleTokenizer with default settings

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

Overview

Function Signature

Parameters

Returns

Examples

Get tokenizer for built-in model

Load from Hugging Face Hub

Load from local directory

Custom context length

Pass custom tokenizer kwargs

Tokenizer Selection Logic

Schema Support

Built-in

HF Hub

Local

Notes

Build docs developers (and LLMs) love

Model Creation

Pretrained Models

Tokenization

Transforms

Model Classes

Loss Functions

Zero-Shot

​Overview

​Function Signature

​Parameters

​Returns

​Examples

​Get tokenizer for built-in model

​Load from Hugging Face Hub

​Load from local directory

​Custom context length

​Pass custom tokenizer kwargs

​Tokenizer Selection Logic

​Schema Support

Built-in

HF Hub

Local

​Notes

Build docs developers (and LLMs) love

Overview

Function Signature

Parameters

Returns

Examples

Get tokenizer for built-in model

Load from Hugging Face Hub

Load from local directory

Custom context length

Pass custom tokenizer kwargs

Tokenizer Selection Logic

Schema Support

Notes