LLM token estimation

Bitaboom provides sophisticated LLM token estimation that accounts for the unique characteristics of Arabic text tokenization across different providers.

Quick start

import { estimateTokenCount, LLMProvider } from 'bitaboom';

// Default (Generic) estimation
const tokens = estimateTokenCount('بسم الله الرحمن الرحيم');

// Provider-specific estimation
const openaiTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.OpenAI);
const geminiTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Gemini);
const claudeTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Claude);
const grokTokens = estimateTokenCount('بسم الله الرحمن الرحيم', LLMProvider.Grok);

Why Arabic needs special handling

Modern LLMs use Byte Pair Encoding (BPE) tokenization, which behaves dramatically differently for Arabic compared to English:

Aspect	English	Arabic	Impact
Characters per token	~4	~1.3 (OpenAI)	Arabic uses 3x more tokens
UTF-8 bytes per char	1	2	Double byte overhead
Diacritics	N/A	Merged with base letters	NOT separate tokens
Morphology	Simple	Rich (prefixes/suffixes)	More subword splits

Using generic English-based token estimation for Arabic text can underestimate costs by 200-300%.

Provider comparison

Different LLM providers have different tokenization efficiency for Arabic:

Provider	Arabic efficiency	Notes
Gemini	Most efficient (~25% better than OpenAI)	Uses SentencePiece, optimized for multilingual
OpenAI	Standard baseline	tiktoken BPE implementation
Grok	Similar to OpenAI	Standard BPE
Claude	Least efficient	Less optimized for Arabic
Generic	Balanced middle ground	Safe default for mixed workloads

Token count comparison

For the same Arabic text “بسم الله الرحمن الرحيم” (Bismillah):

import { estimateTokenCount, LLMProvider } from 'bitaboom';

const text = 'بسم الله الرحمن الرحيم';

console.log('Gemini:', estimateTokenCount(text, LLMProvider.Gemini));   // ~11 tokens
console.log('OpenAI:', estimateTokenCount(text, LLMProvider.OpenAI));   // ~14 tokens
console.log('Grok:', estimateTokenCount(text, LLMProvider.Grok));       // ~14 tokens
console.log('Claude:', estimateTokenCount(text, LLMProvider.Claude));   // ~16 tokens

For cost optimization in production, choose Gemini for Arabic-heavy workloads to reduce token usage by ~25%.

Algorithm details

The estimation uses fertility rates (characters per token) rather than simple per-character weights:

tokens = arabicChars / arabicCharsPerToken
       + latinChars / latinCharsPerToken
       + numerals / numeralGroupSize
       + diacriticOverhead (additive per diacritic)
       + latinDiacriticOverhead (for ā, ī, ū, ḥ, etc.)

Provider configurations

Provider	Latin chars/token	Arabic chars/token	Diacritic overhead	Latin diacritic overhead	Numeral group size
OpenAI	4.0	1.3	+15%	+30%	2.5
Gemini	4.0	1.6	+10%	+15%	2.5
Claude	3.5	1.1	+20%	+25%	2.5
Grok	4.0	1.3	+15%	+30%	2.5
Generic	4.0	1.5	+15%	+25%	3.0

Character type handling

Arabic base characters

Counted using the provider-specific fertility rate:

import { estimateTokenCount, LLMProvider } from 'bitaboom';

// Pure Arabic text
const arabic = 'الكتاب';
estimateTokenCount(arabic, LLMProvider.OpenAI);
// ~5 chars / 1.3 = ~4 tokens

Arabic diacritics (tashkeel)

Diacritics are merged by BPE tokenizers, not separate tokens:

const withDiacritics = 'بِسْمِ اللَّهِ';
const withoutDiacritics = 'بسم الله';

// Diacritics add overhead, but don't create new tokens
estimateTokenCount(withDiacritics, LLMProvider.OpenAI);
// Base tokens + 15% overhead

Diacritics increase token count by adding overhead (10-20% depending on provider), not by creating separate tokens.

Tatweel (kashida)

Decorative elongation character (ـ):

const withTatweel = 'الرحـــمن';
const withoutTatweel = 'الرحمن';

// Tatweel is counted as an Arabic base character
// Minimal impact on token count

Latin diacritics

Used in Arabic transliteration (ā, ī, ū, ḥ, ṣ, etc.):

const transliteration = 'Muḥammad ibn ʿAbdullāh';
estimateTokenCount(transliteration, LLMProvider.OpenAI);
// Latin diacritics have higher overhead (30%) than Arabic diacritics (15%)

Latin diacritics often break tokens or create separate tokens, hence the higher overhead.

Numerals

Both Western and Arabic-Indic numerals are grouped:

const withNumbers = 'الآية 123 والصفحة ٤٥٦';
estimateTokenCount(withNumbers, LLMProvider.OpenAI);
// Numerals grouped by ~2.5 digits per token
// "123" ≈ 1 token, "٤٥٦" ≈ 1 token

Whitespace

Typically absorbed into following token:

const text = 'الله الرحمن الرحيم';
// Spaces don't create separate tokens
// They're merged with adjacent words

Advanced usage patterns

Cost estimation for bilingual content

import { estimateTokenCount, getArabicScore, LLMProvider } from 'bitaboom';

function estimateCost(text: string, provider: LLMProvider) {
  const tokens = estimateTokenCount(text, provider);
  const arabicScore = getArabicScore(text);
  
  // Provider pricing (example rates)
  const rates = {
    [LLMProvider.OpenAI]: 0.00003,    // $0.03 per 1K tokens
    [LLMProvider.Gemini]: 0.000025,   // $0.025 per 1K tokens
    [LLMProvider.Claude]: 0.000015,   // $0.015 per 1K tokens
    [LLMProvider.Grok]: 0.000035,     // $0.035 per 1K tokens
  };
  
  return {
    tokens,
    arabicRatio: (arabicScore * 100).toFixed(1) + '%',
    estimatedCost: tokens * (rates[provider] || 0.00003),
    provider
  };
}

const text = 'بسم الله الرحمن الرحيم In the name of Allah, the Most Gracious, the Most Merciful';
console.log(estimateCost(text, LLMProvider.Gemini));

Batch processing with provider selection

import { estimateTokenCount, LLMProvider } from 'bitaboom';

function processBatch(texts: string[]) {
  return texts.map(text => {
    const estimates = {
      openai: estimateTokenCount(text, LLMProvider.OpenAI),
      gemini: estimateTokenCount(text, LLMProvider.Gemini),
      claude: estimateTokenCount(text, LLMProvider.Claude),
      grok: estimateTokenCount(text, LLMProvider.Grok),
    };
    
    // Find most efficient provider
    const bestProvider = Object.entries(estimates)
      .sort(([, a], [, b]) => a - b)[0][0];
    
    return {
      text: text.substring(0, 50) + '...',
      estimates,
      recommendation: bestProvider
    };
  });
}

Context window management

import { estimateTokenCount, LLMProvider } from 'bitaboom';

function fitToContextWindow(
  prompt: string,
  documents: string[],
  provider: LLMProvider,
  maxTokens = 8000
): string[] {
  const promptTokens = estimateTokenCount(prompt, provider);
  let remainingTokens = maxTokens - promptTokens;
  const fitted: string[] = [];
  
  for (const doc of documents) {
    const docTokens = estimateTokenCount(doc, provider);
    if (docTokens <= remainingTokens) {
      fitted.push(doc);
      remainingTokens -= docTokens;
    } else {
      break;
    }
  }
  
  return fitted;
}

Real-time token monitoring

import { estimateTokenCount, LLMProvider } from 'bitaboom';

class TokenBudgetMonitor {
  private totalTokens = 0;
  
  constructor(
    private readonly maxTokens: number,
    private readonly provider: LLMProvider
  ) {}
  
  addText(text: string): boolean {
    const tokens = estimateTokenCount(text, this.provider);
    
    if (this.totalTokens + tokens > this.maxTokens) {
      return false; // Budget exceeded
    }
    
    this.totalTokens += tokens;
    return true;
  }
  
  getRemainingBudget(): number {
    return this.maxTokens - this.totalTokens;
  }
  
  getUsagePercentage(): number {
    return (this.totalTokens / this.maxTokens) * 100;
  }
}

// Usage
const monitor = new TokenBudgetMonitor(4000, LLMProvider.Gemini);
monitor.addText('بسم الله الرحمن الرحيم');
console.log(`Remaining: ${monitor.getRemainingBudget()} tokens`);
console.log(`Usage: ${monitor.getUsagePercentage().toFixed(1)}%`);

Accuracy and limitations

High accuracy for typical content

Estimation is within 5-10% of actual token counts for most Arabic and bilingual text.

Less accurate for edge cases

Extreme cases (very short texts, unusual Unicode, heavy code-switching) may have higher variance.

Conservative by design

The algorithm tends to slightly overestimate to avoid context window issues.

Provider variations

Actual tokenizers may change over time. This implementation is based on research as of 2024.

For production cost accounting, always use the provider’s official tokenizer (e.g., tiktoken for OpenAI) for final billing calculations. Use Bitaboom’s estimator for planning and optimization.

Performance characteristics

The token estimation algorithm is highly optimized:

Single-pass O(N) iteration over code points
No regex matching (uses character code lookup tables)
Exclusive classification (no double-counting)
Run-length encoding for numeral grouping

import { estimateTokenCount, LLMProvider } from 'bitaboom';

// Fast enough for real-time UI updates
const largeText = arabicBook.join('\n'); // 100KB+
const start = performance.now();
const tokens = estimateTokenCount(largeText, LLMProvider.Gemini);
const duration = performance.now() - start;
console.log(`Estimated ${tokens} tokens in ${duration.toFixed(2)}ms`);

Best practices

Choose the right provider
Budget for safety margin
Batch processing
Empty input handling

import { getArabicScore, estimateTokenCount, LLMProvider } from 'bitaboom';

function selectProvider(text: string) {
  const arabicScore = getArabicScore(text);
  
  if (arabicScore > 0.7) {
    // Heavily Arabic - use Gemini for efficiency
    return LLMProvider.Gemini;
  } else if (arabicScore < 0.3) {
    // Mostly English - provider choice less critical
    return LLMProvider.OpenAI;
  } else {
    // Mixed - test both
    return LLMProvider.Generic;
  }
}

function estimateWithMargin(text: string, provider: LLMProvider, margin = 0.1) {
  const baseEstimate = estimateTokenCount(text, provider);
  return Math.ceil(baseEstimate * (1 + margin));
}

// Add 10% safety margin for context window planning
const safeEstimate = estimateWithMargin(text, LLMProvider.OpenAI);

import { estimateTokenCount, LLMProvider } from 'bitaboom';

function estimateBatch(texts: string[], provider: LLMProvider) {
  // Sum individual estimates for accurate batch totals
  return texts.reduce((total, text) => 
    total + estimateTokenCount(text, provider), 0
  );
}

import { estimateTokenCount, LLMProvider } from 'bitaboom';

// Always returns 0 for empty strings
estimateTokenCount('', LLMProvider.OpenAI); // 0
estimateTokenCount(null as any, LLMProvider.OpenAI); // 0

For RAG systems with Arabic documents, use estimateTokenCount to determine optimal chunk sizes that respect provider-specific token limits.

Get Started

Guides

Examples

LLM token estimation

Quick start

Why Arabic needs special handling

Provider comparison

Token count comparison

Algorithm details

Provider configurations

Character type handling

Arabic base characters

Arabic diacritics (tashkeel)

Tatweel (kashida)

Latin diacritics

Numerals

Whitespace

Advanced usage patterns

Cost estimation for bilingual content

Batch processing with provider selection

Context window management

Real-time token monitoring

Accuracy and limitations

Performance characteristics

Best practices

Build docs developers (and LLMs) love

Get Started

Guides

Examples

​Quick start

​Why Arabic needs special handling

​Provider comparison

​Token count comparison

​Algorithm details

​Provider configurations

​Character type handling

​Arabic base characters

​Arabic diacritics (tashkeel)

​Tatweel (kashida)

​Latin diacritics

​Numerals

​Whitespace

​Advanced usage patterns

​Cost estimation for bilingual content

​Batch processing with provider selection

​Context window management

​Real-time token monitoring

​Accuracy and limitations

​Performance characteristics

​Best practices

Build docs developers (and LLMs) love

Quick start

Why Arabic needs special handling

Provider comparison

Token count comparison

Algorithm details

Provider configurations

Character type handling

Arabic base characters

Arabic diacritics (tashkeel)

Tatweel (kashida)

Latin diacritics

Numerals

Whitespace

Advanced usage patterns

Cost estimation for bilingual content

Batch processing with provider selection

Context window management

Real-time token monitoring

Accuracy and limitations

Performance characteristics

Best practices