Overview
Converts text strings into token tensors suitable for CLIP text encoders. Uses the default SimpleTokenizer with BPE encoding.Function Signature
Parameters
Input text string or list of text strings to tokenize. Text is automatically cleaned and normalized.
Maximum sequence length for tokenization. Sequences longer than this are truncated. Default is 77 (standard for CLIP).
Returns
2D tensor of token IDs with shape
[batch_size, context_length]. Each sequence includes:- Start-of-text token (position 0)
- Encoded text tokens
- End-of-text token
- Zero padding (if sequence is shorter than context_length)
Examples
Basic tokenization
Batch tokenization
Custom context length
Complete inference example
Handle long text with truncation
Token Structure
Each tokenized sequence has the following structure:- SOT: Start-of-text token (ID: 49406)
- EOT: End-of-text token (ID: 49407)
- PAD: Zero padding (ID: 0)
Text Preprocessing
The tokenizer automatically applies:- Basic cleaning: Fixes text encoding issues with ftfy
- HTML unescaping: Decodes HTML entities
- Whitespace normalization: Removes extra whitespace
- Lowercasing: Converts text to lowercase (default behavior)
Notes
- This function uses a module-level
SimpleTokenizerinstance - For custom tokenizers (HuggingFace, SigLIP), use
get_tokenizer()instead - Sequences longer than
context_lengthare truncated, with EOT token placed at the last position - Empty or very short texts still produce valid token sequences with SOT and EOT tokens
See Also
decode()- Convert token IDs back to textget_tokenizer()- Get model-specific tokenizers
