How to Use Tokenizador

Tokenizador is a professional, real-time token analyzer for AI language models. This guide will walk you through all the features and how to use them effectively.

Getting Started

Select Your Model

Choose from 48 AI models across different providers. The model selector is organized by company:

// Available providers in the dropdown
- OpenAI (GPT-4o, GPT-4o Mini, GPT-4 Turbo, GPT-4, GPT-3.5 Turbo)
- Anthropic (Claude 3.5 Sonnet, Claude 3 Opus/Sonnet/Haiku)
- Google (Gemini 1.5 Pro/Flash)
- Meta (Llama 3.1 405B/70B/8B, Llama 3 70B/8B)
- And many more...

Each model has different tokenization characteristics, costs, and context limits.

Enter Your Text

Type or paste your text into the input area. The analyzer processes tokens in real-time as you type:

Tokenization happens automatically with minimal delay, providing instant feedback on token count and costs.

View Results

The interface displays comprehensive statistics and visualizations:

Total token count
Character count
Word count
Estimated cost
Token visualization
Token list with IDs

Understanding the Interface

Statistics Dashboard

The stats grid shows four key metrics updated in real-time:

Total Tokens

Precise token count calculated using the tiktoken library, matching the actual tokenization used by AI providers.

Characters

Total character count including spaces, useful for understanding compression ratios.

Words

Intelligent word counting using whitespace separation.

Estimated Cost

Real-time cost calculation based on current pricing per 1M tokens.

Model Information Panel

After entering text, you’ll see detailed model information:

// Model info displayed from models-config.js
{
  "model": "GPT-4o",
  "contextLimit": "128,000 tokens",
  "encoding": "o200k_base",
  "activeAlgorithm": "o200k_base (GPT Más Reciente)",
  "inputCost": "$2.50/1M tokens",
  "outputCost": "$10.00/1M tokens"
}

Click the “Ver en Artificial Analysis” link to see detailed benchmarks, performance metrics, and comparisons for the selected model.

Token Visualization Features

Color-Coded Tokens

Tokens are displayed with different visual styles based on their type:

Token Type Color Coding

The tokenization service categorizes tokens into different types for better visualization:

palabra (word) - Regular word tokens
palabra_con_espacio - Words that include leading spaces
subword - Parts of longer words (BPE sub-tokens)
espacio_en_blanco (whitespace) - Space and tab tokens
number - Numeric tokens
punctuation - Punctuation marks
special - Special characters and symbols

Each type is styled differently to help you understand how the tokenizer splits your text.

Interactive Token List

The token list shows each token with its unique ID:

// Example token structure from tokenization-service.js
{
  text: "Hello",
  type: "palabra",
  id: "token_0",
  tokenId: 9906,  // Real tiktoken ID
  index: 0,
  isApproximate: false  // true for fallback tokenization
}

Token IDs marked with isApproximate: true indicate that tiktoken wasn’t available and a fallback method was used. The counts remain accurate, but the specific token IDs may differ from production tokenizers.

Advanced Features

Real-Time Cost Estimation

Cost calculation is performed using the pricing data from models-config.js:

// Cost calculation from statistics-calculator.js:58-64
calculateCost(tokenCount, modelInfo) {
  if (!modelInfo || !modelInfo.inputCost) {
    return 0;
  }
  // Calculate cost based on input tokens (cost per 1M tokens)
  return (tokenCount / 1000000) * modelInfo.inputCost;
}

Input Cost
Output Cost

The displayed cost represents the price for input tokens (text sent to the model).Example for GPT-4o:

Input: $2.50 per 1M tokens
1,000 tokens = $0.0025

Context Limit Warnings

The analyzer monitors your context usage:

// Context warnings from statistics-calculator.js:93-109
if (utilization >= 100) {
  return `⚠️ Texto excede el límite de contexto del modelo`;
} else if (utilization >= 90) {
  return `⚠️ Cerca del límite de contexto (${utilization.toFixed(1)}% utilizado)`;
} else if (utilization >= 75) {
  return `ℹ️ Alto uso del contexto (${utilization.toFixed(1)}% utilizado)`;
}

75-89% Usage

High context usage - consider if you have room for responses

90-99% Usage

Near context limit - very little room remaining

100%+ Usage

Exceeds context limit - text will be truncated

Tips for Effective Use

Compare Models for Cost Optimization

Different models tokenize text differently. Compare models to find the most cost-effective option:

GPT-4o uses o200k_base encoding (newer, more efficient)
GPT-4 uses cl100k_base encoding
Claude models typically use ~10% more tokens
Llama models typically use ~5% fewer tokens

Use the model selector to switch between models and compare token counts and costs.

Understand Token-to-Word Ratios

Monitor the tokens-per-word ratio in your statistics:

// From statistics-calculator.js:34
tokensPerWord: wordCount > 0 ? (tokenCount / wordCount) : 0

English text: typically 1.3-1.5 tokens per word
Code: can be 2-3+ tokens per word
Special characters: may be individual tokens

Use the Clear Button

Click the “Limpiar” (Clear) button to reset the analyzer and start fresh. This is tracked in analytics:

// From index.html:473
clearBtn.addEventListener('click', () => {
  trackEvent('text_cleared', 'User Action');
});

Keyboard Shortcuts

The text input area supports all standard keyboard shortcuts for text editing:

Ctrl/Cmd + A - Select all
Ctrl/Cmd + C - Copy
Ctrl/Cmd + V - Paste
Ctrl/Cmd + Z - Undo

Mobile Usage

Tokenizador is fully responsive and works on mobile devices:

Touch-friendly interface
Optimized model selector for mobile
Responsive token visualization
Full feature parity with desktop

<!-- Responsive design from index.html:6 -->
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="apple-mobile-web-app-capable" content="yes">

Troubleshooting

Token IDs Show as Approximate

If you see isApproximate: true in token data, the tiktoken library couldn’t load. The fallback tokenizer is being used:

// Fallback from tokenization-service.js:301-340
fallbackTokenization(text, modelId) {
  console.log('⚠️ Usando tokenización de respaldo - IDs no serán precisos');
  // Creates deterministic token IDs based on content
}

Token counts remain accurate, but specific IDs may differ from production.

Costs Seem Different from API Usage

The analyzer calculates input token costs only. Actual API costs include:

Output tokens (usually more expensive)
Any additional API fees
Volume discounts or special pricing

Use this tool for estimation and comparison, not billing.

Model Not in Dropdown

Next Steps

Supported Models

Explore all 48 models with detailed pricing and specifications

Understanding Tokenization

Learn how tokenization works and why it matters

Cost Estimation

Deep dive into how costs are calculated

GitHub Repository

View source code and contribute

Get Started

Guides

Architecture

How to Use Tokenizador

Getting Started

Understanding the Interface

Statistics Dashboard

Total Tokens

Characters

Words

Estimated Cost

Model Information Panel

Token Visualization Features

Color-Coded Tokens

Interactive Token List

Advanced Features

Real-Time Cost Estimation

Context Limit Warnings

75-89% Usage

90-99% Usage

100%+ Usage

Tips for Effective Use

Keyboard Shortcuts

Mobile Usage

Troubleshooting

Next Steps

Supported Models

Understanding Tokenization

Cost Estimation

GitHub Repository

Build docs developers (and LLMs) love

Get Started

Guides

Architecture

​Getting Started

​Understanding the Interface

​Statistics Dashboard

Total Tokens

Characters

Words

Estimated Cost

​Model Information Panel

​Token Visualization Features

​Color-Coded Tokens

​Interactive Token List

​Advanced Features

​Real-Time Cost Estimation

​Context Limit Warnings

75-89% Usage

90-99% Usage

100%+ Usage

​Tips for Effective Use

​Keyboard Shortcuts

​Mobile Usage

​Troubleshooting

​Next Steps

Supported Models

Understanding Tokenization

Cost Estimation

GitHub Repository

Build docs developers (and LLMs) love

Getting Started

Understanding the Interface

Statistics Dashboard

Model Information Panel

Token Visualization Features

Color-Coded Tokens

Interactive Token List

Advanced Features

Real-Time Cost Estimation

Context Limit Warnings

Tips for Effective Use

Keyboard Shortcuts

Mobile Usage

Troubleshooting

Next Steps