Skip to main content

Overview

Converts text strings into token tensors suitable for CLIP text encoders. Uses the default SimpleTokenizer with BPE encoding.

Function Signature

def tokenize(
    texts: Union[str, List[str]], 
    context_length: int = 77
) -> torch.LongTensor

Parameters

texts
Union[str, List[str]]
required
Input text string or list of text strings to tokenize. Text is automatically cleaned and normalized.
context_length
int
default:"77"
Maximum sequence length for tokenization. Sequences longer than this are truncated. Default is 77 (standard for CLIP).

Returns

tokens
torch.LongTensor
2D tensor of token IDs with shape [batch_size, context_length]. Each sequence includes:
  • Start-of-text token (position 0)
  • Encoded text tokens
  • End-of-text token
  • Zero padding (if sequence is shorter than context_length)

Examples

Basic tokenization

import open_clip

# Tokenize single text
text = "a photo of a cat"
tokens = open_clip.tokenize(text)
print(tokens.shape)  # torch.Size([1, 77])
print(tokens[0, :10])  # First 10 tokens

Batch tokenization

# Tokenize multiple texts
texts = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a bird"
]
tokens = open_clip.tokenize(texts)
print(tokens.shape)  # torch.Size([3, 77])

Custom context length

# Use longer context for more tokens
long_text = "a very detailed description with many words"
tokens = open_clip.tokenize(long_text, context_length=128)
print(tokens.shape)  # torch.Size([1, 128])

Complete inference example

import torch
import open_clip
from PIL import Image

# Load model
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
model.eval()

# Prepare text
texts = ["a cat", "a dog", "a bird"]
text_tokens = open_clip.tokenize(texts)

# Encode text
with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)

print(text_features.shape)  # torch.Size([3, 512])

Handle long text with truncation

# Long text is automatically truncated
long_description = " ".join(["word"] * 100)
tokens = open_clip.tokenize(long_description, context_length=77)

# Check for truncation (last non-zero token should be EOT)
eot_token_id = 49407
print(f"Last token is EOT: {tokens[0, -1] == eot_token_id or tokens[0, tokens[0].nonzero()[-1]] == eot_token_id}")

Token Structure

Each tokenized sequence has the following structure:
[SOT] [token_1] [token_2] ... [token_n] [EOT] [PAD] [PAD] ...
  • SOT: Start-of-text token (ID: 49406)
  • EOT: End-of-text token (ID: 49407)
  • PAD: Zero padding (ID: 0)

Text Preprocessing

The tokenizer automatically applies:
  1. Basic cleaning: Fixes text encoding issues with ftfy
  2. HTML unescaping: Decodes HTML entities
  3. Whitespace normalization: Removes extra whitespace
  4. Lowercasing: Converts text to lowercase (default behavior)

Notes

  • This function uses a module-level SimpleTokenizer instance
  • For custom tokenizers (HuggingFace, SigLIP), use get_tokenizer() instead
  • Sequences longer than context_length are truncated, with EOT token placed at the last position
  • Empty or very short texts still produce valid token sequences with SOT and EOT tokens

See Also

Build docs developers (and LLMs) love