Overview
TheTokenizationService class manages all tokenization operations in Tokenizador. It integrates with the tiktoken library to provide accurate token IDs and counts, with intelligent fallback mechanisms when tiktoken is unavailable.
This service supports 48 AI models from OpenAI, Anthropic, Google, Meta, and other providers.
Constructor
Creates a new TokenizationService instance and begins initialization.encoder: null (set after initialization)isInitialized: false (set to true when ready)initPromise: Promise for initialization trackingisRealTiktoken: Indicates if real tiktoken or fallback is being used
Methods
initializeTokenizer()
Initializes the tiktoken encoder asynchronously.Resolves when tokenizer is initialized (or fallback is ready)
- Waits up to 10 seconds for tiktoken library to load
- Checks multiple locations: global context, window object
- Initializes
cl100k_baseencoding (GPT-4 compatible) - Performs test tokenization to verify functionality
- Sets
isRealTiktokenflag based on tiktoken availability
waitForInitialization()
Waits for the tokenizer to complete initialization.Resolves when initialization is complete
tokenizeText()
Tokenizes text using the appropriate method for the specified model.The text to tokenize
Model identifier (e.g., “gpt-4o”, “claude-3.5-sonnet”)
Object containing tokens array and count
Array of token objects with text, type, ID, and metadata
Total number of tokens
The actual text of the token
Token type: “palabra”, “subword”, “number”, “punctuation”, “special”, “espacio_en_blanco”
Unique identifier for the token (e.g., “token_0”)
Numeric token ID from tiktoken (or approximation)
Zero-based position in the token sequence
True if token ID is approximate (fallback mode)
createTokensFromEncoding()
Creates visual token objects from tiktoken encoding.Original input text
Array of token IDs from tiktoken.encode()
Model identifier
Array of token objects for visualization
- Iterates through each encoded token ID
- Decodes individual tokens to get exact text
- Determines token type based on content
- Creates token object with metadata
- Marks tokens as approximate if using fallback
fallbackTokenization()
Provides tokenization when tiktoken is unavailable.Text to tokenize
Model identifier
Object with tokens array and count
- Splits text into words and whitespace segments
- Uses heuristics to approximate token boundaries
- Generates deterministic IDs based on content
- Marks all tokens as
isApproximate: true
splitWordIntoTokens()
Splits a word into smaller tokens simulating tiktoken behavior.Word to split into tokens
Starting token index
Array of token objects
- Words ≤3 characters: single token
- Longer words: split based on ~2.8 characters per token ratio
- First chunk marked as “palabra”, subsequent as “subword”
- Dynamic chunk sizing based on remaining characters
determineTokenType()
Determines the type of a token based on its content.Token text to classify
Token type: “number”, “punctuation”, “special”, or “palabra”
- Number
- Punctuation
- Special
- Palabra
Token contains only digits:
^\d+$createDeterministicId()
Creates a deterministic numeric ID for fallback tokens.Token text
Token index
Deterministic ID in range 10000-109999
- Generates simple hash from character codes
- Combines with index for uniqueness
- Normalizes to 5-digit range (10000-109999)
getTokenizerName()
Returns a human-readable name for a tokenizer encoding.Encoding identifier (e.g., “cl100k_base”)
Display name for the tokenizer
getAlgorithmName()
Returns a description of the tokenization algorithm for a model.Model identifier
Algorithm description
Token Types
The service classifies tokens into these categories:palabra
Standard word token
subword
Part of a longer word
palabra_con_espacio
Word with leading space
number
Numeric token
punctuation
Punctuation marks
special
Special characters
espacio_en_blanco
Whitespace
unknown
Decode failure
Usage Examples
Model Support
The service supports multiple encoding strategies:cl100k_base Models
cl100k_base Models
OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo
Anthropic: Claude 3 Opus, Claude 3.5 Sonnet
Meta: Llama 3.1 (with ratio adjustment)Uses exact tiktoken encoding with model-specific token ratios.
Anthropic: Claude 3 Opus, Claude 3.5 Sonnet
Meta: Llama 3.1 (with ratio adjustment)Uses exact tiktoken encoding with model-specific token ratios.
o200k_base Models
o200k_base Models
OpenAI: GPT-4o, GPT-4o MiniUses newer tokenizer with improved efficiency.
Other Encodings
Other Encodings
Google: Gemini (SentencePiece approximation)
Mistral: Mistral models (ratio-based approximation)
Cohere: Command models (ratio-based approximation)Uses fallback with model-specific token ratios.
Mistral: Mistral models (ratio-based approximation)
Cohere: Command models (ratio-based approximation)Uses fallback with model-specific token ratios.
Error Handling
The service gracefully falls back to approximate tokenization if tiktoken fails to load. Your application continues working with slightly reduced accuracy.
See Also
TokenAnalyzer
Main application orchestrator
StatisticsCalculator
Calculate costs and statistics
Supported Models
View all 48 supported models
Architecture
Understand the system design