Skip to main content
Tokenizador is a professional, real-time token analyzer for AI language models. This guide will walk you through all the features and how to use them effectively.

Getting Started

1

Select Your Model

Choose from 48 AI models across different providers. The model selector is organized by company:
// Available providers in the dropdown
- OpenAI (GPT-4o, GPT-4o Mini, GPT-4 Turbo, GPT-4, GPT-3.5 Turbo)
- Anthropic (Claude 3.5 Sonnet, Claude 3 Opus/Sonnet/Haiku)
- Google (Gemini 1.5 Pro/Flash)
- Meta (Llama 3.1 405B/70B/8B, Llama 3 70B/8B)
- And many more...
Each model has different tokenization characteristics, costs, and context limits.
2

Enter Your Text

Type or paste your text into the input area. The analyzer processes tokens in real-time as you type:
Tokenization happens automatically with minimal delay, providing instant feedback on token count and costs.
3

View Results

The interface displays comprehensive statistics and visualizations:
  • Total token count
  • Character count
  • Word count
  • Estimated cost
  • Token visualization
  • Token list with IDs

Understanding the Interface

Statistics Dashboard

The stats grid shows four key metrics updated in real-time:

Total Tokens

Precise token count calculated using the tiktoken library, matching the actual tokenization used by AI providers.

Characters

Total character count including spaces, useful for understanding compression ratios.

Words

Intelligent word counting using whitespace separation.

Estimated Cost

Real-time cost calculation based on current pricing per 1M tokens.

Model Information Panel

After entering text, you’ll see detailed model information:
// Model info displayed from models-config.js
{
  "model": "GPT-4o",
  "contextLimit": "128,000 tokens",
  "encoding": "o200k_base",
  "activeAlgorithm": "o200k_base (GPT Más Reciente)",
  "inputCost": "$2.50/1M tokens",
  "outputCost": "$10.00/1M tokens"
}
Click the “Ver en Artificial Analysis” link to see detailed benchmarks, performance metrics, and comparisons for the selected model.

Token Visualization Features

Color-Coded Tokens

Tokens are displayed with different visual styles based on their type:
The tokenization service categorizes tokens into different types for better visualization:
  • palabra (word) - Regular word tokens
  • palabra_con_espacio - Words that include leading spaces
  • subword - Parts of longer words (BPE sub-tokens)
  • espacio_en_blanco (whitespace) - Space and tab tokens
  • number - Numeric tokens
  • punctuation - Punctuation marks
  • special - Special characters and symbols
Each type is styled differently to help you understand how the tokenizer splits your text.

Interactive Token List

The token list shows each token with its unique ID:
// Example token structure from tokenization-service.js
{
  text: "Hello",
  type: "palabra",
  id: "token_0",
  tokenId: 9906,  // Real tiktoken ID
  index: 0,
  isApproximate: false  // true for fallback tokenization
}
Token IDs marked with isApproximate: true indicate that tiktoken wasn’t available and a fallback method was used. The counts remain accurate, but the specific token IDs may differ from production tokenizers.

Advanced Features

Real-Time Cost Estimation

Cost calculation is performed using the pricing data from models-config.js:
// Cost calculation from statistics-calculator.js:58-64
calculateCost(tokenCount, modelInfo) {
  if (!modelInfo || !modelInfo.inputCost) {
    return 0;
  }
  // Calculate cost based on input tokens (cost per 1M tokens)
  return (tokenCount / 1000000) * modelInfo.inputCost;
}
The displayed cost represents the price for input tokens (text sent to the model).Example for GPT-4o:
  • Input: $2.50 per 1M tokens
  • 1,000 tokens = $0.0025

Context Limit Warnings

The analyzer monitors your context usage:
// Context warnings from statistics-calculator.js:93-109
if (utilization >= 100) {
  return `⚠️ Texto excede el límite de contexto del modelo`;
} else if (utilization >= 90) {
  return `⚠️ Cerca del límite de contexto (${utilization.toFixed(1)}% utilizado)`;
} else if (utilization >= 75) {
  return `ℹ️ Alto uso del contexto (${utilization.toFixed(1)}% utilizado)`;
}

75-89% Usage

High context usage - consider if you have room for responses

90-99% Usage

Near context limit - very little room remaining

100%+ Usage

Exceeds context limit - text will be truncated

Tips for Effective Use

Different models tokenize text differently. Compare models to find the most cost-effective option:
  • GPT-4o uses o200k_base encoding (newer, more efficient)
  • GPT-4 uses cl100k_base encoding
  • Claude models typically use ~10% more tokens
  • Llama models typically use ~5% fewer tokens
Use the model selector to switch between models and compare token counts and costs.
Monitor the tokens-per-word ratio in your statistics:
// From statistics-calculator.js:34
tokensPerWord: wordCount > 0 ? (tokenCount / wordCount) : 0
  • English text: typically 1.3-1.5 tokens per word
  • Code: can be 2-3+ tokens per word
  • Special characters: may be individual tokens
Click the “Limpiar” (Clear) button to reset the analyzer and start fresh. This is tracked in analytics:
// From index.html:473
clearBtn.addEventListener('click', () => {
  trackEvent('text_cleared', 'User Action');
});

Keyboard Shortcuts

The text input area supports all standard keyboard shortcuts for text editing:
  • Ctrl/Cmd + A - Select all
  • Ctrl/Cmd + C - Copy
  • Ctrl/Cmd + V - Paste
  • Ctrl/Cmd + Z - Undo

Mobile Usage

Tokenizador is fully responsive and works on mobile devices:
  • Touch-friendly interface
  • Optimized model selector for mobile
  • Responsive token visualization
  • Full feature parity with desktop
<!-- Responsive design from index.html:6 -->
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="apple-mobile-web-app-capable" content="yes">

Troubleshooting

If you see isApproximate: true in token data, the tiktoken library couldn’t load. The fallback tokenizer is being used:
// Fallback from tokenization-service.js:301-340
fallbackTokenization(text, modelId) {
  console.log('⚠️ Usando tokenización de respaldo - IDs no serán precisos');
  // Creates deterministic token IDs based on content
}
Token counts remain accurate, but specific IDs may differ from production.
The analyzer calculates input token costs only. Actual API costs include:
  • Output tokens (usually more expensive)
  • Any additional API fees
  • Volume discounts or special pricing
Use this tool for estimation and comparison, not billing.
The tool includes 48 models as of the last update. If you need a newer model:
  1. Use a similar model from the same provider
  2. Models from the same family usually share tokenization
  3. Check the supported models guide for the complete list

Next Steps

Supported Models

Explore all 48 models with detailed pricing and specifications

Understanding Tokenization

Learn how tokenization works and why it matters

Cost Estimation

Deep dive into how costs are calculated

GitHub Repository

View source code and contribute

Build docs developers (and LLMs) love