Overview
Returns a tokenizer instance based on the model name or identifier schema. Automatically selects the appropriate tokenizer type (HFTokenizer, SigLipTokenizer, or SimpleTokenizer) based on the model configuration.Function Signature
Parameters
Model identifier that determines which tokenizer to use. Supports multiple schemas:
'ViT-B-32': Built-in model name (looks up config)'hf-hub:org/repo': Load from Hugging Face Hub'local-dir:/path/to/folder': Load from local directory
Maximum sequence length for tokenization. If None, uses the value from model config or defaults to 77.
Directory to cache downloaded tokenizer files when loading from Hugging Face Hub.
Additional tokenizer-specific keyword arguments passed to the tokenizer constructor. Overrides config values.
Returns
Tokenizer instance appropriate for the specified model:
- SimpleTokenizer: Default OpenAI CLIP tokenizer (BPE-based)
- HFTokenizer: Hugging Face transformers tokenizer wrapper
- SigLipTokenizer: SigLIP T5-compatible sentencepiece tokenizer
Examples
Get tokenizer for built-in model
Load from Hugging Face Hub
Load from local directory
Custom context length
Pass custom tokenizer kwargs
Tokenizer Selection Logic
The function automatically selects the tokenizer type based on:- HFTokenizer: Used if
hf_tokenizer_nameis specified in the model’s text config - SigLipTokenizer: Used for models with ‘siglip’ in the name (when no HF tokenizer specified)
- SimpleTokenizer: Default fallback for OpenAI CLIP models
Schema Support
Built-in
Use model names from OpenCLIP’s model registry
HF Hub
Load from Hugging Face Hub:
hf-hub:org/repoLocal
Load from local directory:
local-dir:/pathNotes
- For
local-dirschema, anopen_clip_config.jsonfile must exist in the directory - Context length priority: function argument > model config > default (77)
- Tokenizer kwargs from the function call override those in the model config
- If model config cannot be loaded, falls back to SimpleTokenizer with default settings
