What Are Tokens?
Tokens are the basic units of text that language models process. They can be words, parts of words, or even individual characters.
Token Examples
- English Text
- Code
- Special Characters
Why Tokenization Matters
Cost Calculation
API costs are based on token count, not character count. Understanding tokenization helps estimate and optimize costs.
Context Limits
Models have maximum token limits (context windows). Efficient tokenization means fitting more information in the same context.Example: GPT-4o has 128,000 token limit
- Efficient text: ~480,000 characters
- Inefficient text: ~200,000 characters
Model Performance
Tokenization affects how models “understand” text. Better tokenization means better comprehension and generation.
Cross-Model Comparison
Different models tokenize differently. The same text may use different token counts across providers.Example: “Hello world”
- GPT: 2 tokens
- Claude: ~2.2 tokens (10% more)
- Llama: ~1.9 tokens (5% fewer)
Tokenization Methods
BPE (Byte Pair Encoding)
How BPE Works
How BPE Works
BPE is the most common tokenization algorithm used by modern language models.Process:This is why common words become single tokens while rare words get split into subwords.
- Start with a vocabulary of individual characters
- Find the most frequent pair of adjacent tokens
- Merge that pair into a new token
- Repeat until reaching desired vocabulary size
WordPiece
WordPiece Tokenization
WordPiece Tokenization
Used by some Google models. Similar to BPE but uses likelihood instead of frequency.Key Difference:
- BPE: Merge most frequent pairs
- WordPiece: Merge pairs that maximize likelihood of training data
SentencePiece
SentencePiece Algorithm
SentencePiece Algorithm
Used by Llama, many multilingual models, and some Google models.Advantages:
- Language agnostic (doesn’t require pre-tokenization)
- Treats spaces as characters
- Works well for languages without spaces
Encoding Types in Tokenizador
o200k_base (Latest OpenAI)
Used By
- GPT-4o
- GPT-4o Mini
Characteristics
- 200,000 token vocabulary
- More efficient than cl100k_base
- Better handling of code
- Improved multilingual support
cl100k_base (Standard GPT-4)
Used By
- GPT-4 (all variants)
- GPT-3.5 Turbo
- Used as approximation for:
- Claude models
- Gemini models
- Most other providers
Characteristics
- 100,000 token vocabulary
- Industry standard
- Well-tested and reliable
- Good balance of efficiency
Why Approximation? Non-OpenAI models use proprietary tokenizers, but cl100k_base provides a close approximation. The
tokenRatio field adjusts for differences.How Tokenizador Handles Encodings
Token Ratios Explained
What is tokenRatio?
What is tokenRatio?
The How it’s applied:
tokenRatio field in models-config.js adjusts token counts for models that don’t use OpenAI’s tokenization.- More Efficient (< 1.0)
- Standard (1.0)
- Less Efficient (> 1.0)
Models that use fewer tokens:
Impact: Lower costs and more content in context window.Example:
| Model Family | Ratio | Difference |
|---|---|---|
| Qwen | 0.92 | 8% fewer tokens |
| DeepSeek | 0.93 | 7% fewer tokens |
| Jamba | 0.94 | 6% fewer tokens |
| Llama | 0.95 | 5% fewer tokens |
| IBM Granite | 0.96 | 4% fewer tokens |
Token Types and Visualization
Token Classification
palabra
Regular word tokens
- Complete words
- Word beginnings
- Most common type
subword
Parts of words
- Word endings
- Middle parts
- Rare word components
espacio_en_blanco
Whitespace tokens
- Spaces
- Tabs
- Line breaks
number
Numeric tokens
- Integers
- Digits
- Number parts
punctuation
Punctuation marks
- Periods, commas
- Quotes
- Brackets
special
Special characters
- Symbols
- Emoji components
- Unicode characters
Tokenization Best Practices
1. Choose Token-Efficient Models
1. Choose Token-Efficient Models
When cost is a concern, select models with lower token ratios:
2. Optimize Your Prompts
2. Optimize Your Prompts
Write efficiently:
- Remove unnecessary words
- Use common vocabulary (fewer tokens)
- Avoid excessive formatting
3. Monitor Context Usage
3. Monitor Context Usage
Use Tokenizador to track context utilization:Guidelines:
- < 75%: Safe zone
- 75-90%: Monitor closely
- 90-100%: Optimize or switch models
-
100%: Content will be truncated
4. Test Across Models
4. Test Across Models
The same text can have very different token counts:Test your actual workload to find the best model.
5. Understand Language Differences
5. Understand Language Differences
Tokenization efficiency varies by language:
Real-World Examples
Example 1: API Documentation
- Input
- Tokenization
Example 2: Customer Support Message
- Input
- Tokenization
Example 3: Long-Form Content
- Input
- Tokenization
Advanced Topics
Tiktoken Library
How Tokenizador Uses Tiktoken
How Tokenizador Uses Tiktoken
Tokenizador uses the official OpenAI tiktoken library for precise tokenization:Fallback mechanism:
Token ID Structure
Understanding Token IDs
Understanding Token IDs
Each token has a unique numerical ID in the vocabulary:Common token IDs:
- 0-255: Single byte tokens
- 256-50,000: Common words and subwords
- 50,000-100,000: Less common combinations
- Models process IDs, not text
- Same ID = same meaning to the model
- Different text can have same ID (homonyms)
Fallback Tokenization
When Tiktoken Isn't Available
When Tiktoken Isn't Available
If tiktoken fails to load, Tokenizador uses a sophisticated fallback:Accuracy:
- Token counts: 95-98% accurate
- Token IDs: Approximate (deterministic but not real)
- Visual split: Close approximation
isApproximate: true for transparency.Common Questions
Why do different models have different token counts?
Why do different models have different token counts?
Each model uses its own tokenizer trained on its training data:
- OpenAI: Optimized for English and code
- Anthropic: Different vocabulary, more tokens
- Meta (Llama): SentencePiece, slightly more efficient
- Google: Multilingual focus, different tradeoffs
tokenRatio field adjusts for these differences.Does tokenization affect model quality?
Does tokenization affect model quality?
Yes, tokenization impacts:
- Understanding: Better tokenization = better comprehension
- Generation: Affects output fluency
- Efficiency: More tokens = slower processing
- Cost: Directly determines API costs
Can I customize tokenization?
Can I customize tokenization?
No, tokenization is fixed per model. However, you can:
- Choose models with better tokenization for your use case
- Optimize prompts to reduce tokens
- Format text efficiently (remove extra spaces, etc.)
- Select language-specific models for non-English text
What's the difference between tokens and words?
What's the difference between tokens and words?
Tokens are algorithmic units, words are linguistic units:Typical ratio: 1.3-1.5 tokens per word in English.
Further Reading
How to Use
Practical guide to using Tokenizador
Supported Models
Complete model list with specifications
Cost Estimation
Understanding and optimizing costs
Tiktoken Repository
Official tokenization library