Bitaboom provides sophisticated LLM token estimation that accounts for the unique characteristics of Arabic text tokenization across different providers.
Quick start
import { estimateTokenCount, LLMProvider } from 'bitaboom';
// Default (Generic) estimation
const tokens = estimateTokenCount('بسم الله الرحمن الرحيم');
// Provider-specific estimation
const openaiTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.OpenAI);
const geminiTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Gemini);
const claudeTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Claude);
const grokTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Grok);
Why Arabic needs special handling
Modern LLMs use Byte Pair Encoding (BPE) tokenization, which behaves dramatically differently for Arabic compared to English:
| Aspect | English | Arabic | Impact |
|---|
| Characters per token | ~4 | ~1.3 (OpenAI) | Arabic uses 3x more tokens |
| UTF-8 bytes per char | 1 | 2 | Double byte overhead |
| Diacritics | N/A | Merged with base letters | NOT separate tokens |
| Morphology | Simple | Rich (prefixes/suffixes) | More subword splits |
Using generic English-based token estimation for Arabic text can underestimate costs by 200-300%.
Provider comparison
Different LLM providers have different tokenization efficiency for Arabic:
| Provider | Arabic efficiency | Notes |
|---|
| Gemini | Most efficient (~25% better than OpenAI) | Uses SentencePiece, optimized for multilingual |
| OpenAI | Standard baseline | tiktoken BPE implementation |
| Grok | Similar to OpenAI | Standard BPE |
| Claude | Least efficient | Less optimized for Arabic |
| Generic | Balanced middle ground | Safe default for mixed workloads |
Token count comparison
For the same Arabic text “بسم الله الرحمن الرحيم” (Bismillah):
import { estimateTokenCount, LLMProvider } from 'bitaboom';
const text = 'بسم الله الرحمن الرحيم';
console.log('Gemini:', estimateTokenCount(text, LLMProvider.Gemini)); // ~11 tokens
console.log('OpenAI:', estimateTokenCount(text, LLMProvider.OpenAI)); // ~14 tokens
console.log('Grok:', estimateTokenCount(text, LLMProvider.Grok)); // ~14 tokens
console.log('Claude:', estimateTokenCount(text, LLMProvider.Claude)); // ~16 tokens
For cost optimization in production, choose Gemini for Arabic-heavy workloads to reduce token usage by ~25%.
Algorithm details
The estimation uses fertility rates (characters per token) rather than simple per-character weights:
tokens = arabicChars / arabicCharsPerToken
+ latinChars / latinCharsPerToken
+ numerals / numeralGroupSize
+ diacriticOverhead (additive per diacritic)
+ latinDiacriticOverhead (for ā, ī, ū, ḥ, etc.)
Provider configurations
| Provider | Latin chars/token | Arabic chars/token | Diacritic overhead | Latin diacritic overhead | Numeral group size |
|---|
| OpenAI | 4.0 | 1.3 | +15% | +30% | 2.5 |
| Gemini | 4.0 | 1.6 | +10% | +15% | 2.5 |
| Claude | 3.5 | 1.1 | +20% | +25% | 2.5 |
| Grok | 4.0 | 1.3 | +15% | +30% | 2.5 |
| Generic | 4.0 | 1.5 | +15% | +25% | 3.0 |
Character type handling
Arabic base characters
Counted using the provider-specific fertility rate:
import { estimateTokenCount, LLMProvider } from 'bitaboom';
// Pure Arabic text
const arabic = 'الكتاب';
estimateTokenCount(arabic, LLMProvider.OpenAI);
// ~5 chars / 1.3 = ~4 tokens
Arabic diacritics (tashkeel)
Diacritics are merged by BPE tokenizers, not separate tokens:
const withDiacritics = 'بِسْمِ اللَّهِ';
const withoutDiacritics = 'بسم الله';
// Diacritics add overhead, but don't create new tokens
estimateTokenCount(withDiacritics, LLMProvider.OpenAI);
// Base tokens + 15% overhead
Diacritics increase token count by adding overhead (10-20% depending on provider), not by creating separate tokens.
Tatweel (kashida)
Decorative elongation character (ـ):
const withTatweel = 'الرحـــمن';
const withoutTatweel = 'الرحمن';
// Tatweel is counted as an Arabic base character
// Minimal impact on token count
Latin diacritics
Used in Arabic transliteration (ā, ī, ū, ḥ, ṣ, etc.):
const transliteration = 'Muḥammad ibn ʿAbdullāh';
estimateTokenCount(transliteration, LLMProvider.OpenAI);
// Latin diacritics have higher overhead (30%) than Arabic diacritics (15%)
Latin diacritics often break tokens or create separate tokens, hence the higher overhead.
Numerals
Both Western and Arabic-Indic numerals are grouped:
const withNumbers = 'الآية 123 والصفحة ٤٥٦';
estimateTokenCount(withNumbers, LLMProvider.OpenAI);
// Numerals grouped by ~2.5 digits per token
// "123" ≈ 1 token, "٤٥٦" ≈ 1 token
Whitespace
Typically absorbed into following token:
const text = 'الله الرحمن الرحيم';
// Spaces don't create separate tokens
// They're merged with adjacent words
Advanced usage patterns
Cost estimation for bilingual content
import { estimateTokenCount, getArabicScore, LLMProvider } from 'bitaboom';
function estimateCost(text: string, provider: LLMProvider) {
const tokens = estimateTokenCount(text, provider);
const arabicScore = getArabicScore(text);
// Provider pricing (example rates)
const rates = {
[LLMProvider.OpenAI]: 0.00003, // $0.03 per 1K tokens
[LLMProvider.Gemini]: 0.000025, // $0.025 per 1K tokens
[LLMProvider.Claude]: 0.000015, // $0.015 per 1K tokens
[LLMProvider.Grok]: 0.000035, // $0.035 per 1K tokens
};
return {
tokens,
arabicRatio: (arabicScore * 100).toFixed(1) + '%',
estimatedCost: tokens * (rates[provider] || 0.00003),
provider
};
}
const text = 'بسم الله الرحمن الرحيم In the name of Allah, the Most Gracious, the Most Merciful';
console.log(estimateCost(text, LLMProvider.Gemini));
Batch processing with provider selection
import { estimateTokenCount, LLMProvider } from 'bitaboom';
function processBatch(texts: string[]) {
return texts.map(text => {
const estimates = {
openai: estimateTokenCount(text, LLMProvider.OpenAI),
gemini: estimateTokenCount(text, LLMProvider.Gemini),
claude: estimateTokenCount(text, LLMProvider.Claude),
grok: estimateTokenCount(text, LLMProvider.Grok),
};
// Find most efficient provider
const bestProvider = Object.entries(estimates)
.sort(([, a], [, b]) => a - b)[0][0];
return {
text: text.substring(0, 50) + '...',
estimates,
recommendation: bestProvider
};
});
}
Context window management
import { estimateTokenCount, LLMProvider } from 'bitaboom';
function fitToContextWindow(
prompt: string,
documents: string[],
provider: LLMProvider,
maxTokens = 8000
): string[] {
const promptTokens = estimateTokenCount(prompt, provider);
let remainingTokens = maxTokens - promptTokens;
const fitted: string[] = [];
for (const doc of documents) {
const docTokens = estimateTokenCount(doc, provider);
if (docTokens <= remainingTokens) {
fitted.push(doc);
remainingTokens -= docTokens;
} else {
break;
}
}
return fitted;
}
Real-time token monitoring
import { estimateTokenCount, LLMProvider } from 'bitaboom';
class TokenBudgetMonitor {
private totalTokens = 0;
constructor(
private readonly maxTokens: number,
private readonly provider: LLMProvider
) {}
addText(text: string): boolean {
const tokens = estimateTokenCount(text, this.provider);
if (this.totalTokens + tokens > this.maxTokens) {
return false; // Budget exceeded
}
this.totalTokens += tokens;
return true;
}
getRemainingBudget(): number {
return this.maxTokens - this.totalTokens;
}
getUsagePercentage(): number {
return (this.totalTokens / this.maxTokens) * 100;
}
}
// Usage
const monitor = new TokenBudgetMonitor(4000, LLMProvider.Gemini);
monitor.addText('بسم الله الرحمن الرحيم');
console.log(`Remaining: ${monitor.getRemainingBudget()} tokens`);
console.log(`Usage: ${monitor.getUsagePercentage().toFixed(1)}%`);
Accuracy and limitations
High accuracy for typical content
Estimation is within 5-10% of actual token counts for most Arabic and bilingual text.
Less accurate for edge cases
Extreme cases (very short texts, unusual Unicode, heavy code-switching) may have higher variance.
Conservative by design
The algorithm tends to slightly overestimate to avoid context window issues.
Provider variations
Actual tokenizers may change over time. This implementation is based on research as of 2024.
For production cost accounting, always use the provider’s official tokenizer (e.g., tiktoken for OpenAI) for final billing calculations. Use Bitaboom’s estimator for planning and optimization.
The token estimation algorithm is highly optimized:
- Single-pass O(N) iteration over code points
- No regex matching (uses character code lookup tables)
- Exclusive classification (no double-counting)
- Run-length encoding for numeral grouping
import { estimateTokenCount, LLMProvider } from 'bitaboom';
// Fast enough for real-time UI updates
const largeText = arabicBook.join('\n'); // 100KB+
const start = performance.now();
const tokens = estimateTokenCount(largeText, LLMProvider.Gemini);
const duration = performance.now() - start;
console.log(`Estimated ${tokens} tokens in ${duration.toFixed(2)}ms`);
Best practices
import { getArabicScore, estimateTokenCount, LLMProvider } from 'bitaboom';
function selectProvider(text: string) {
const arabicScore = getArabicScore(text);
if (arabicScore > 0.7) {
// Heavily Arabic - use Gemini for efficiency
return LLMProvider.Gemini;
} else if (arabicScore < 0.3) {
// Mostly English - provider choice less critical
return LLMProvider.OpenAI;
} else {
// Mixed - test both
return LLMProvider.Generic;
}
}
function estimateWithMargin(text: string, provider: LLMProvider, margin = 0.1) {
const baseEstimate = estimateTokenCount(text, provider);
return Math.ceil(baseEstimate * (1 + margin));
}
// Add 10% safety margin for context window planning
const safeEstimate = estimateWithMargin(text, LLMProvider.OpenAI);
import { estimateTokenCount, LLMProvider } from 'bitaboom';
function estimateBatch(texts: string[], provider: LLMProvider) {
// Sum individual estimates for accurate batch totals
return texts.reduce((total, text) =>
total + estimateTokenCount(text, provider), 0
);
}
import { estimateTokenCount, LLMProvider } from 'bitaboom';
// Always returns 0 for empty strings
estimateTokenCount('', LLMProvider.OpenAI); // 0
estimateTokenCount(null as any, LLMProvider.OpenAI); // 0
For RAG systems with Arabic documents, use estimateTokenCount to determine optimal chunk sizes that respect provider-specific token limits.