TokenizationService

Overview

The TokenizationService class manages all tokenization operations in Tokenizador. It integrates with the tiktoken library to provide accurate token IDs and counts, with intelligent fallback mechanisms when tiktoken is unavailable.

This service supports 48 AI models from OpenAI, Anthropic, Google, Meta, and other providers.

Constructor

Creates a new TokenizationService instance and begins initialization.

const service = new TokenizationService();

Properties initialized:

encoder: null (set after initialization)
isInitialized: false (set to true when ready)
initPromise: Promise for initialization tracking
isRealTiktoken: Indicates if real tiktoken or fallback is being used

Methods

initializeTokenizer()

Initializes the tiktoken encoder asynchronously.

async initializeTokenizer()

returns

Promise<void>

Resolves when tokenizer is initialized (or fallback is ready)

Initialization process:

Waits up to 10 seconds for tiktoken library to load
Checks multiple locations: global context, window object
Initializes cl100k_base encoding (GPT-4 compatible)
Performs test tokenization to verify functionality
Sets isRealTiktoken flag based on tiktoken availability

const service = new TokenizationService();
await service.waitForInitialization();

if (service.isInitialized) {
  console.log('Tokenizer ready!');
  console.log('Using real tiktoken:', service.isRealTiktoken);
}

If tiktoken fails to load, the service automatically uses fallback tokenization. Token IDs will be marked as approximate.

waitForInitialization()

Waits for the tokenizer to complete initialization.

async waitForInitialization()

returns

Promise<void>

Resolves when initialization is complete

const service = new TokenizationService();

// Wait before tokenizing
await service.waitForInitialization();

// Now safe to tokenize
const result = await service.tokenizeText('Hello world', 'gpt-4o');

tokenizeText()

Tokenizes text using the appropriate method for the specified model.

async tokenizeText(text, modelId)

text

string

required

The text to tokenize

modelId

string

required

Model identifier (e.g., “gpt-4o”, “claude-3.5-sonnet”)

returns

Promise<Object>

Object containing tokens array and count

Return value structure:

tokens

Array<Object>

Array of token objects with text, type, ID, and metadata

count

number

Total number of tokens

Token object structure:

tokens[].text

string

The actual text of the token

tokens[].type

string

Token type: “palabra”, “subword”, “number”, “punctuation”, “special”, “espacio_en_blanco”

tokens[].id

string

Unique identifier for the token (e.g., “token_0”)

tokens[].tokenId

number

Numeric token ID from tiktoken (or approximation)

tokens[].index

number

Zero-based position in the token sequence

tokens[].isApproximate

boolean

True if token ID is approximate (fallback mode)

const service = new TokenizationService();
await service.waitForInitialization();

const result = await service.tokenizeText('Hello world!', 'gpt-4o');

console.log('Token count:', result.count);
console.log('Tokens:', result.tokens);

// Output:
// Token count: 3
// Tokens: [
//   { text: 'Hello', type: 'palabra', tokenId: 9906, ... },
//   { text: ' world', type: 'palabra_con_espacio', tokenId: 1917, ... },
//   { text: '!', type: 'punctuation', tokenId: 0, ... }
// ]

For models using cl100k_base encoding (GPT-4, Claude, etc.), token IDs are exact. For other models, counts are adjusted using model-specific ratios.

createTokensFromEncoding()

Creates visual token objects from tiktoken encoding.

createTokensFromEncoding(text, encoded, modelId)

text

string

required

Original input text

encoded

number[]

required

Array of token IDs from tiktoken.encode()

modelId

string

required

Model identifier

returns

Array<Object>

Array of token objects for visualization

Process:

Iterates through each encoded token ID
Decodes individual tokens to get exact text
Determines token type based on content
Creates token object with metadata
Marks tokens as approximate if using fallback

const service = new TokenizationService();
await service.waitForInitialization();

const text = 'Hello world';
const encoded = service.encoder.encode(text);

const tokens = service.createTokensFromEncoding(text, encoded, 'gpt-4o');

console.log(tokens);
// [
//   {
//     text: 'Hello',
//     type: 'palabra',
//     id: 'token_0',
//     tokenId: 9906,
//     index: 0,
//     isApproximate: false
//   },
//   ...
// ]

fallbackTokenization()

Provides tokenization when tiktoken is unavailable.

fallbackTokenization(text, modelId)

text

string

required

Text to tokenize

modelId

string

required

Model identifier

returns

Object

Object with tokens array and count

Fallback strategy:

Splits text into words and whitespace segments
Uses heuristics to approximate token boundaries
Generates deterministic IDs based on content
Marks all tokens as isApproximate: true

Fallback tokenization provides approximate results. Token IDs will not match actual tiktoken IDs but counts are reasonably accurate.

splitWordIntoTokens()

Splits a word into smaller tokens simulating tiktoken behavior.

splitWordIntoTokens(word, startIndex)

word

string

required

Word to split into tokens

startIndex

number

required

Starting token index

returns

Array<Object>

Array of token objects

Algorithm:

Words ≤3 characters: single token
Longer words: split based on ~2.8 characters per token ratio
First chunk marked as “palabra”, subsequent as “subword”
Dynamic chunk sizing based on remaining characters

const service = new TokenizationService();

const tokens = service.splitWordIntoTokens('tokenization', 0);

console.log(tokens);
// [
//   { text: 'token', type: 'palabra', ... },
//   { text: 'iza', type: 'subword', ... },
//   { text: 'tion', type: 'subword', ... }
// ]

determineTokenType()

Determines the type of a token based on its content.

determineTokenType(text)

text

string

required

Token text to classify

returns

string

Token type: “number”, “punctuation”, “special”, or “palabra”

Classification rules:

Number
Punctuation
Special
Palabra

Token contains only digits: ^\d+$

determineTokenType('123') // => 'number'

Token is only punctuation marks: ^[.,!?;:'"()\[\]{}]+$

determineTokenType('.,!') // => 'punctuation'

Token contains only special characters: ^[^\w\s]+$

determineTokenType('@#$') // => 'special'

Default for word-like tokens

determineTokenType('hello') // => 'palabra'

createDeterministicId()

Creates a deterministic numeric ID for fallback tokens.

createDeterministicId(text, index)

text

string

required

Token text

index

number

required

Token index

returns

number

Deterministic ID in range 10000-109999

Algorithm:

Generates simple hash from character codes
Combines with index for uniqueness
Normalizes to 5-digit range (10000-109999)

const service = new TokenizationService();

const id1 = service.createDeterministicId('hello', 0);
const id2 = service.createDeterministicId('hello', 1);
const id3 = service.createDeterministicId('world', 0);

console.log(id1); // e.g., 45712
console.log(id2); // e.g., 46712 (same text, different index)
console.log(id3); // e.g., 52341 (different text)

getTokenizerName()

Returns a human-readable name for a tokenizer encoding.

getTokenizerName(encoding)

encoding

string

required

Encoding identifier (e.g., “cl100k_base”)

returns

string

Display name for the tokenizer

const service = new TokenizationService();

console.log(service.getTokenizerName('o200k_base'));  // "Tokenizador GPT-4o"
console.log(service.getTokenizerName('cl100k_base')); // "Tokenizador GPT-4"
console.log(service.getTokenizerName('p50k_base'));   // "Tokenizador GPT-3"

getAlgorithmName()

Returns a description of the tokenization algorithm for a model.

getAlgorithmName(modelId)

modelId

string

required

Model identifier

returns

string

Algorithm description

const service = new TokenizationService();

console.log(service.getAlgorithmName('gpt-4o'));
// "o200k_base (GPT Más Reciente)"

console.log(service.getAlgorithmName('claude-3.5-sonnet'));
// "Tokenización Claude (~20% más tokens)"

console.log(service.getAlgorithmName('llama-3.1-70b'));
// "Tokenización Llama (~15% menos tokens)"

Token Types

The service classifies tokens into these categories:

palabra

Standard word token

subword

Part of a longer word

palabra_con_espacio

Word with leading space

number

Numeric token

punctuation

Punctuation marks

special

Special characters

espacio_en_blanco

Whitespace

unknown

Decode failure

Usage Examples

const service = new TokenizationService();
await service.waitForInitialization();

const result = await service.tokenizeText(
  'Hello, world!',
  'gpt-4o'
);

console.log(`Tokenized into ${result.count} tokens`);
result.tokens.forEach(token => {
  console.log(`"${token.text}" [${token.type}] ID: ${token.tokenId}`);
});

Model Support

The service supports multiple encoding strategies:

cl100k_base Models

OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo
Anthropic: Claude 3 Opus, Claude 3.5 Sonnet
Meta: Llama 3.1 (with ratio adjustment)Uses exact tiktoken encoding with model-specific token ratios.

o200k_base Models

OpenAI: GPT-4o, GPT-4o MiniUses newer tokenizer with improved efficiency.

Other Encodings

Google: Gemini (SentencePiece approximation)
Mistral: Mistral models (ratio-based approximation)
Cohere: Command models (ratio-based approximation)Uses fallback with model-specific token ratios.

Error Handling

const service = new TokenizationService();
await service.waitForInitialization();

if (!service.isInitialized) {
  console.error('Tokenization service failed to initialize');
  // Service will use fallback mode automatically
}

try {
  const result = await service.tokenizeText('test', 'gpt-4o');
  console.log('Tokenization successful:', result);
} catch (error) {
  console.error('Tokenization error:', error);
}

The service gracefully falls back to approximate tokenization if tiktoken fails to load. Your application continues working with slightly reduced accuracy.

TokenAnalyzer

Main application orchestrator

StatisticsCalculator

Calculate costs and statistics

Supported Models

View all 48 supported models

Architecture

Understand the system design

Core Classes

Configuration

Overview

Constructor

Methods

initializeTokenizer()

waitForInitialization()

tokenizeText()

createTokensFromEncoding()

fallbackTokenization()

splitWordIntoTokens()

determineTokenType()

createDeterministicId()

getTokenizerName()

getAlgorithmName()

Token Types

palabra

subword

palabra_con_espacio

number

punctuation

special

espacio_en_blanco

unknown

Usage Examples

Model Support

Error Handling

See Also

TokenAnalyzer

StatisticsCalculator

Supported Models

Architecture

Build docs developers (and LLMs) love

Core Classes

Configuration

​Overview

​Constructor

​Methods

​initializeTokenizer()

​waitForInitialization()

​tokenizeText()

​createTokensFromEncoding()

​fallbackTokenization()

​splitWordIntoTokens()

​determineTokenType()

​createDeterministicId()

​getTokenizerName()

​getAlgorithmName()

​Token Types

palabra

subword

palabra_con_espacio

number

punctuation

special

espacio_en_blanco

unknown

​Usage Examples

​Model Support

​Error Handling

​See Also

TokenAnalyzer

StatisticsCalculator

Supported Models

Architecture

Build docs developers (and LLMs) love

Overview

Constructor

Methods

initializeTokenizer()

waitForInitialization()

tokenizeText()

createTokensFromEncoding()

fallbackTokenization()

splitWordIntoTokens()

determineTokenType()

createDeterministicId()

getTokenizerName()

getAlgorithmName()

Token Types

Usage Examples

Model Support

Error Handling

See Also