Skip to main content

Overview

The TokenizationService class manages all tokenization operations in Tokenizador. It integrates with the tiktoken library to provide accurate token IDs and counts, with intelligent fallback mechanisms when tiktoken is unavailable.
This service supports 48 AI models from OpenAI, Anthropic, Google, Meta, and other providers.

Constructor

Creates a new TokenizationService instance and begins initialization.
const service = new TokenizationService();
Properties initialized:
  • encoder: null (set after initialization)
  • isInitialized: false (set to true when ready)
  • initPromise: Promise for initialization tracking
  • isRealTiktoken: Indicates if real tiktoken or fallback is being used

Methods

initializeTokenizer()

Initializes the tiktoken encoder asynchronously.
async initializeTokenizer()
returns
Promise<void>
Resolves when tokenizer is initialized (or fallback is ready)
Initialization process:
  1. Waits up to 10 seconds for tiktoken library to load
  2. Checks multiple locations: global context, window object
  3. Initializes cl100k_base encoding (GPT-4 compatible)
  4. Performs test tokenization to verify functionality
  5. Sets isRealTiktoken flag based on tiktoken availability
const service = new TokenizationService();
await service.waitForInitialization();

if (service.isInitialized) {
  console.log('Tokenizer ready!');
  console.log('Using real tiktoken:', service.isRealTiktoken);
}
If tiktoken fails to load, the service automatically uses fallback tokenization. Token IDs will be marked as approximate.

waitForInitialization()

Waits for the tokenizer to complete initialization.
async waitForInitialization()
returns
Promise<void>
Resolves when initialization is complete
const service = new TokenizationService();

// Wait before tokenizing
await service.waitForInitialization();

// Now safe to tokenize
const result = await service.tokenizeText('Hello world', 'gpt-4o');

tokenizeText()

Tokenizes text using the appropriate method for the specified model.
async tokenizeText(text, modelId)
text
string
required
The text to tokenize
modelId
string
required
Model identifier (e.g., “gpt-4o”, “claude-3.5-sonnet”)
returns
Promise<Object>
Object containing tokens array and count
Return value structure:
tokens
Array<Object>
Array of token objects with text, type, ID, and metadata
count
number
Total number of tokens
Token object structure:
tokens[].text
string
The actual text of the token
tokens[].type
string
Token type: “palabra”, “subword”, “number”, “punctuation”, “special”, “espacio_en_blanco”
tokens[].id
string
Unique identifier for the token (e.g., “token_0”)
tokens[].tokenId
number
Numeric token ID from tiktoken (or approximation)
tokens[].index
number
Zero-based position in the token sequence
tokens[].isApproximate
boolean
True if token ID is approximate (fallback mode)
const service = new TokenizationService();
await service.waitForInitialization();

const result = await service.tokenizeText('Hello world!', 'gpt-4o');

console.log('Token count:', result.count);
console.log('Tokens:', result.tokens);

// Output:
// Token count: 3
// Tokens: [
//   { text: 'Hello', type: 'palabra', tokenId: 9906, ... },
//   { text: ' world', type: 'palabra_con_espacio', tokenId: 1917, ... },
//   { text: '!', type: 'punctuation', tokenId: 0, ... }
// ]
For models using cl100k_base encoding (GPT-4, Claude, etc.), token IDs are exact. For other models, counts are adjusted using model-specific ratios.

createTokensFromEncoding()

Creates visual token objects from tiktoken encoding.
createTokensFromEncoding(text, encoded, modelId)
text
string
required
Original input text
encoded
number[]
required
Array of token IDs from tiktoken.encode()
modelId
string
required
Model identifier
returns
Array<Object>
Array of token objects for visualization
Process:
  1. Iterates through each encoded token ID
  2. Decodes individual tokens to get exact text
  3. Determines token type based on content
  4. Creates token object with metadata
  5. Marks tokens as approximate if using fallback
const service = new TokenizationService();
await service.waitForInitialization();

const text = 'Hello world';
const encoded = service.encoder.encode(text);

const tokens = service.createTokensFromEncoding(text, encoded, 'gpt-4o');

console.log(tokens);
// [
//   {
//     text: 'Hello',
//     type: 'palabra',
//     id: 'token_0',
//     tokenId: 9906,
//     index: 0,
//     isApproximate: false
//   },
//   ...
// ]

fallbackTokenization()

Provides tokenization when tiktoken is unavailable.
fallbackTokenization(text, modelId)
text
string
required
Text to tokenize
modelId
string
required
Model identifier
returns
Object
Object with tokens array and count
Fallback strategy:
  • Splits text into words and whitespace segments
  • Uses heuristics to approximate token boundaries
  • Generates deterministic IDs based on content
  • Marks all tokens as isApproximate: true
Fallback tokenization provides approximate results. Token IDs will not match actual tiktoken IDs but counts are reasonably accurate.

splitWordIntoTokens()

Splits a word into smaller tokens simulating tiktoken behavior.
splitWordIntoTokens(word, startIndex)
word
string
required
Word to split into tokens
startIndex
number
required
Starting token index
returns
Array<Object>
Array of token objects
Algorithm:
  • Words ≤3 characters: single token
  • Longer words: split based on ~2.8 characters per token ratio
  • First chunk marked as “palabra”, subsequent as “subword”
  • Dynamic chunk sizing based on remaining characters
const service = new TokenizationService();

const tokens = service.splitWordIntoTokens('tokenization', 0);

console.log(tokens);
// [
//   { text: 'token', type: 'palabra', ... },
//   { text: 'iza', type: 'subword', ... },
//   { text: 'tion', type: 'subword', ... }
// ]

determineTokenType()

Determines the type of a token based on its content.
determineTokenType(text)
text
string
required
Token text to classify
returns
string
Token type: “number”, “punctuation”, “special”, or “palabra”
Classification rules:
Token contains only digits: ^\d+$
determineTokenType('123') // => 'number'

createDeterministicId()

Creates a deterministic numeric ID for fallback tokens.
createDeterministicId(text, index)
text
string
required
Token text
index
number
required
Token index
returns
number
Deterministic ID in range 10000-109999
Algorithm:
  1. Generates simple hash from character codes
  2. Combines with index for uniqueness
  3. Normalizes to 5-digit range (10000-109999)
const service = new TokenizationService();

const id1 = service.createDeterministicId('hello', 0);
const id2 = service.createDeterministicId('hello', 1);
const id3 = service.createDeterministicId('world', 0);

console.log(id1); // e.g., 45712
console.log(id2); // e.g., 46712 (same text, different index)
console.log(id3); // e.g., 52341 (different text)

getTokenizerName()

Returns a human-readable name for a tokenizer encoding.
getTokenizerName(encoding)
encoding
string
required
Encoding identifier (e.g., “cl100k_base”)
returns
string
Display name for the tokenizer
const service = new TokenizationService();

console.log(service.getTokenizerName('o200k_base'));  // "Tokenizador GPT-4o"
console.log(service.getTokenizerName('cl100k_base')); // "Tokenizador GPT-4"
console.log(service.getTokenizerName('p50k_base'));   // "Tokenizador GPT-3"

getAlgorithmName()

Returns a description of the tokenization algorithm for a model.
getAlgorithmName(modelId)
modelId
string
required
Model identifier
returns
string
Algorithm description
const service = new TokenizationService();

console.log(service.getAlgorithmName('gpt-4o'));
// "o200k_base (GPT Más Reciente)"

console.log(service.getAlgorithmName('claude-3.5-sonnet'));
// "Tokenización Claude (~20% más tokens)"

console.log(service.getAlgorithmName('llama-3.1-70b'));
// "Tokenización Llama (~15% menos tokens)"

Token Types

The service classifies tokens into these categories:

palabra

Standard word token

subword

Part of a longer word

palabra_con_espacio

Word with leading space

number

Numeric token

punctuation

Punctuation marks

special

Special characters

espacio_en_blanco

Whitespace

unknown

Decode failure

Usage Examples

const service = new TokenizationService();
await service.waitForInitialization();

const result = await service.tokenizeText(
  'Hello, world!',
  'gpt-4o'
);

console.log(`Tokenized into ${result.count} tokens`);
result.tokens.forEach(token => {
  console.log(`"${token.text}" [${token.type}] ID: ${token.tokenId}`);
});

Model Support

The service supports multiple encoding strategies:
OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo
Anthropic: Claude 3 Opus, Claude 3.5 Sonnet
Meta: Llama 3.1 (with ratio adjustment)
Uses exact tiktoken encoding with model-specific token ratios.
OpenAI: GPT-4o, GPT-4o MiniUses newer tokenizer with improved efficiency.
Google: Gemini (SentencePiece approximation)
Mistral: Mistral models (ratio-based approximation)
Cohere: Command models (ratio-based approximation)
Uses fallback with model-specific token ratios.

Error Handling

const service = new TokenizationService();
await service.waitForInitialization();

if (!service.isInitialized) {
  console.error('Tokenization service failed to initialize');
  // Service will use fallback mode automatically
}

try {
  const result = await service.tokenizeText('test', 'gpt-4o');
  console.log('Tokenization successful:', result);
} catch (error) {
  console.error('Tokenization error:', error);
}
The service gracefully falls back to approximate tokenization if tiktoken fails to load. Your application continues working with slightly reduced accuracy.

See Also

TokenAnalyzer

Main application orchestrator

StatisticsCalculator

Calculate costs and statistics

Supported Models

View all 48 supported models

Architecture

Understand the system design

Build docs developers (and LLMs) love