Skip to main content
Bitaboom provides sophisticated LLM token estimation that accounts for the unique characteristics of Arabic text tokenization across different providers.

Quick start

import { estimateTokenCount, LLMProvider } from 'bitaboom';

// Default (Generic) estimation
const tokens = estimateTokenCount('بسم الله الرحمن الرحيم');

// Provider-specific estimation
const openaiTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.OpenAI);
const geminiTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Gemini);
const claudeTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Claude);
const grokTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Grok);

Why Arabic needs special handling

Modern LLMs use Byte Pair Encoding (BPE) tokenization, which behaves dramatically differently for Arabic compared to English:
AspectEnglishArabicImpact
Characters per token~4~1.3 (OpenAI)Arabic uses 3x more tokens
UTF-8 bytes per char12Double byte overhead
DiacriticsN/AMerged with base lettersNOT separate tokens
MorphologySimpleRich (prefixes/suffixes)More subword splits
Using generic English-based token estimation for Arabic text can underestimate costs by 200-300%.

Provider comparison

Different LLM providers have different tokenization efficiency for Arabic:
ProviderArabic efficiencyNotes
GeminiMost efficient (~25% better than OpenAI)Uses SentencePiece, optimized for multilingual
OpenAIStandard baselinetiktoken BPE implementation
GrokSimilar to OpenAIStandard BPE
ClaudeLeast efficientLess optimized for Arabic
GenericBalanced middle groundSafe default for mixed workloads

Token count comparison

For the same Arabic text “بسم الله الرحمن الرحيم” (Bismillah):
import { estimateTokenCount, LLMProvider } from 'bitaboom';

const text = 'بسم الله الرحمن الرحيم';

console.log('Gemini:', estimateTokenCount(text, LLMProvider.Gemini));   // ~11 tokens
console.log('OpenAI:', estimateTokenCount(text, LLMProvider.OpenAI));   // ~14 tokens
console.log('Grok:', estimateTokenCount(text, LLMProvider.Grok));       // ~14 tokens
console.log('Claude:', estimateTokenCount(text, LLMProvider.Claude));   // ~16 tokens
For cost optimization in production, choose Gemini for Arabic-heavy workloads to reduce token usage by ~25%.

Algorithm details

The estimation uses fertility rates (characters per token) rather than simple per-character weights:
tokens = arabicChars / arabicCharsPerToken
       + latinChars / latinCharsPerToken
       + numerals / numeralGroupSize
       + diacriticOverhead (additive per diacritic)
       + latinDiacriticOverhead (for ā, ī, ū, , etc.)

Provider configurations

ProviderLatin chars/tokenArabic chars/tokenDiacritic overheadLatin diacritic overheadNumeral group size
OpenAI4.01.3+15%+30%2.5
Gemini4.01.6+10%+15%2.5
Claude3.51.1+20%+25%2.5
Grok4.01.3+15%+30%2.5
Generic4.01.5+15%+25%3.0

Character type handling

Arabic base characters

Counted using the provider-specific fertility rate:
import { estimateTokenCount, LLMProvider } from 'bitaboom';

// Pure Arabic text
const arabic = 'الكتاب';
estimateTokenCount(arabic, LLMProvider.OpenAI);
// ~5 chars / 1.3 = ~4 tokens

Arabic diacritics (tashkeel)

Diacritics are merged by BPE tokenizers, not separate tokens:
const withDiacritics = 'بِسْمِ اللَّهِ';
const withoutDiacritics = 'بسم الله';

// Diacritics add overhead, but don't create new tokens
estimateTokenCount(withDiacritics, LLMProvider.OpenAI);
// Base tokens + 15% overhead
Diacritics increase token count by adding overhead (10-20% depending on provider), not by creating separate tokens.

Tatweel (kashida)

Decorative elongation character (ـ):
const withTatweel = 'الرحـــمن';
const withoutTatweel = 'الرحمن';

// Tatweel is counted as an Arabic base character
// Minimal impact on token count

Latin diacritics

Used in Arabic transliteration (ā, ī, ū, ḥ, ṣ, etc.):
const transliteration = 'Muḥammad ibn ʿAbdullāh';
estimateTokenCount(transliteration, LLMProvider.OpenAI);
// Latin diacritics have higher overhead (30%) than Arabic diacritics (15%)
Latin diacritics often break tokens or create separate tokens, hence the higher overhead.

Numerals

Both Western and Arabic-Indic numerals are grouped:
const withNumbers = 'الآية 123 والصفحة ٤٥٦';
estimateTokenCount(withNumbers, LLMProvider.OpenAI);
// Numerals grouped by ~2.5 digits per token
// "123" ≈ 1 token, "٤٥٦" ≈ 1 token

Whitespace

Typically absorbed into following token:
const text = 'الله الرحمن الرحيم';
// Spaces don't create separate tokens
// They're merged with adjacent words

Advanced usage patterns

Cost estimation for bilingual content

import { estimateTokenCount, getArabicScore, LLMProvider } from 'bitaboom';

function estimateCost(text: string, provider: LLMProvider) {
  const tokens = estimateTokenCount(text, provider);
  const arabicScore = getArabicScore(text);
  
  // Provider pricing (example rates)
  const rates = {
    [LLMProvider.OpenAI]: 0.00003,    // $0.03 per 1K tokens
    [LLMProvider.Gemini]: 0.000025,   // $0.025 per 1K tokens
    [LLMProvider.Claude]: 0.000015,   // $0.015 per 1K tokens
    [LLMProvider.Grok]: 0.000035,     // $0.035 per 1K tokens
  };
  
  return {
    tokens,
    arabicRatio: (arabicScore * 100).toFixed(1) + '%',
    estimatedCost: tokens * (rates[provider] || 0.00003),
    provider
  };
}

const text = 'بسم الله الرحمن الرحيم In the name of Allah, the Most Gracious, the Most Merciful';
console.log(estimateCost(text, LLMProvider.Gemini));

Batch processing with provider selection

import { estimateTokenCount, LLMProvider } from 'bitaboom';

function processBatch(texts: string[]) {
  return texts.map(text => {
    const estimates = {
      openai: estimateTokenCount(text, LLMProvider.OpenAI),
      gemini: estimateTokenCount(text, LLMProvider.Gemini),
      claude: estimateTokenCount(text, LLMProvider.Claude),
      grok: estimateTokenCount(text, LLMProvider.Grok),
    };
    
    // Find most efficient provider
    const bestProvider = Object.entries(estimates)
      .sort(([, a], [, b]) => a - b)[0][0];
    
    return {
      text: text.substring(0, 50) + '...',
      estimates,
      recommendation: bestProvider
    };
  });
}

Context window management

import { estimateTokenCount, LLMProvider } from 'bitaboom';

function fitToContextWindow(
  prompt: string,
  documents: string[],
  provider: LLMProvider,
  maxTokens = 8000
): string[] {
  const promptTokens = estimateTokenCount(prompt, provider);
  let remainingTokens = maxTokens - promptTokens;
  const fitted: string[] = [];
  
  for (const doc of documents) {
    const docTokens = estimateTokenCount(doc, provider);
    if (docTokens <= remainingTokens) {
      fitted.push(doc);
      remainingTokens -= docTokens;
    } else {
      break;
    }
  }
  
  return fitted;
}

Real-time token monitoring

import { estimateTokenCount, LLMProvider } from 'bitaboom';

class TokenBudgetMonitor {
  private totalTokens = 0;
  
  constructor(
    private readonly maxTokens: number,
    private readonly provider: LLMProvider
  ) {}
  
  addText(text: string): boolean {
    const tokens = estimateTokenCount(text, this.provider);
    
    if (this.totalTokens + tokens > this.maxTokens) {
      return false; // Budget exceeded
    }
    
    this.totalTokens += tokens;
    return true;
  }
  
  getRemainingBudget(): number {
    return this.maxTokens - this.totalTokens;
  }
  
  getUsagePercentage(): number {
    return (this.totalTokens / this.maxTokens) * 100;
  }
}

// Usage
const monitor = new TokenBudgetMonitor(4000, LLMProvider.Gemini);
monitor.addText('بسم الله الرحمن الرحيم');
console.log(`Remaining: ${monitor.getRemainingBudget()} tokens`);
console.log(`Usage: ${monitor.getUsagePercentage().toFixed(1)}%`);

Accuracy and limitations

1

High accuracy for typical content

Estimation is within 5-10% of actual token counts for most Arabic and bilingual text.
2

Less accurate for edge cases

Extreme cases (very short texts, unusual Unicode, heavy code-switching) may have higher variance.
3

Conservative by design

The algorithm tends to slightly overestimate to avoid context window issues.
4

Provider variations

Actual tokenizers may change over time. This implementation is based on research as of 2024.
For production cost accounting, always use the provider’s official tokenizer (e.g., tiktoken for OpenAI) for final billing calculations. Use Bitaboom’s estimator for planning and optimization.

Performance characteristics

The token estimation algorithm is highly optimized:
  • Single-pass O(N) iteration over code points
  • No regex matching (uses character code lookup tables)
  • Exclusive classification (no double-counting)
  • Run-length encoding for numeral grouping
import { estimateTokenCount, LLMProvider } from 'bitaboom';

// Fast enough for real-time UI updates
const largeText = arabicBook.join('\n'); // 100KB+
const start = performance.now();
const tokens = estimateTokenCount(largeText, LLMProvider.Gemini);
const duration = performance.now() - start;
console.log(`Estimated ${tokens} tokens in ${duration.toFixed(2)}ms`);

Best practices

import { getArabicScore, estimateTokenCount, LLMProvider } from 'bitaboom';

function selectProvider(text: string) {
  const arabicScore = getArabicScore(text);
  
  if (arabicScore > 0.7) {
    // Heavily Arabic - use Gemini for efficiency
    return LLMProvider.Gemini;
  } else if (arabicScore < 0.3) {
    // Mostly English - provider choice less critical
    return LLMProvider.OpenAI;
  } else {
    // Mixed - test both
    return LLMProvider.Generic;
  }
}
For RAG systems with Arabic documents, use estimateTokenCount to determine optimal chunk sizes that respect provider-specific token limits.

Build docs developers (and LLMs) love