Skip to main content

arabicNumeralToNumber

Converts Arabic-Indic numerals (٠-٩) to a JavaScript number. This function finds all Arabic-Indic digits in the input string and converts them to their corresponding Arabic (Western) digits, then parses the result as an integer. Arabic-Indic digits mapping:
  • ٠ → 0, ١ → 1, ٢ → 2, ٣ → 3, ٤ → 4
  • ٥ → 5, ٦ → 6, ٧ → 7, ٨ → 8, ٩ → 9
arabic
string
The string containing Arabic-Indic numerals to convert
Returns: number - The parsed integer value of the converted numerals. Returns NaN if no valid Arabic-Indic digits are found.
arabicNumeralToNumber("١٢٣"); // returns 123
arabicNumeralToNumber("٥٠"); // returns 50
arabicNumeralToNumber("abc١٢٣xyz"); // returns 123 (non-digits ignored)
arabicNumeralToNumber(""); // returns NaN

cleanExtremeArabicUnderscores

Removes extreme Arabic underscores (ـ) that appear at the beginning or end of a line or in text. Does not affect Hijri dates (e.g., 1424هـ) or specific Arabic terms.
text
string
The input text to apply the rule to
Returns: string - The modified text with extreme underscores removed.
cleanExtremeArabicUnderscores("ـThis is a textـ"); // "This is a text"

convertUrduSymbolsToArabic

Converts Urdu symbols to their Arabic equivalents.
text
string
The input text containing Urdu symbols
Returns: string - The modified text with Urdu symbols converted to Arabic symbols.
convertUrduSymbolsToArabic("ھذا"); // "هذا"
convertUrduSymbolsToArabic("ی"); // "ي"

getArabicScore

Calculates the proportion of Arabic characters in text relative to total non-whitespace, non-digit characters. Digits (ASCII and Arabic-Indic variants) are excluded from both numerator and denominator.
text
string
The input text to analyze
Returns: number - A decimal between 0-1 representing the Arabic character ratio (0 = no Arabic, 1 = all Arabic).
getArabicScore("مرحبا"); // 1.0 (100% Arabic)
getArabicScore("Hello مرحبا"); // ~0.5 (mixed)
getArabicScore("Hello"); // 0.0 (no Arabic)

findLastPunctuation

Finds the position of the last punctuation character in a string.
text
string
The text to search through
Returns: number - The index of the last punctuation character, or -1 if none found.
const text = "Hello world! How are you?";
const lastPuncIndex = findLastPunctuation(text);
// Result: 24 (position of the last '?')

const noPuncText = "Hello world";
const notFound = findLastPunctuation(noPuncText);
// Result: -1 (no punctuation found)

fixTrailingWow

Fixes the trailing “و” (wow) in phrases such as “عليكم و رحمة” to “عليكم ورحمة”. This function attempts to correct phrases where “و” appears unnecessarily, particularly in greetings.
text
string
The input text containing the “و” character
Returns: string - The modified text with unnecessary trailing “و” characters corrected.
fixTrailingWow("السلام عليكم و رحمة"); // "السلام عليكم ورحمة"

addSpaceBetweenArabicTextAndNumbers

Inserts a space between Arabic text and numbers.
text
string
The input text containing Arabic text followed by numbers
Returns: string - The modified text with spaces inserted between Arabic text and numbers.
addSpaceBetweenArabicTextAndNumbers("الآية37"); // "الآية 37"

removeNonIndexSignatures

Removes single-digit numbers surrounded by Arabic text. Also removes dashes (-) not followed by a number. For example, removes ‘3’ from ‘وهب 3 وقال’ but does not remove ‘121’ from ‘لوحه 121 الجرح’.
text
string
The input text to apply the rule to
Returns: string - The modified text with non-index numbers and dashes removed.

removeSingularCodes

Removes characters enclosed in square brackets [] or parentheses () if they are Arabic letters or Arabic-Indic numerals.
text
string
The input text to apply the rule to
Returns: string - The modified text with singular codes removed.
removeSingularCodes("[س]"); // ""
removeSingularCodes("(س)"); // ""

removeSolitaryArabicLetters

Removes solitary Arabic letters unless they are the ‘ha’ letter, which is used in Hijri years.
text
string
The input text to apply the rule to
Returns: string - The modified text with solitary Arabic letters removed.
removeSolitaryArabicLetters("ب ا الكلمات ت"); // "ا الكلمات"

replaceEnglishPunctuationWithArabic

Replaces English punctuation (question mark and semicolon) with their Arabic equivalents.
text
string
The input text to apply the rule to
Returns: string - The modified text with English punctuation replaced by Arabic punctuation.
replaceEnglishPunctuationWithArabic("What?"); // "What؟"
replaceEnglishPunctuationWithArabic("Item; Another"); // "Item؛ Another"

countWords

Counts words in text by splitting on whitespace. Works for both Arabic and English text.
text
string
The text to count words in
Returns: number - Number of words in the text.
countWords("مرحبا بك"); // 2
countWords("Hello world"); // 2

LLM token estimation

LLMProvider

Supported LLM providers for token estimation. Each provider has different tokenization characteristics based on their BPE implementation.
enum LLMProvider {
  /** OpenAI GPT models (GPT-3.5, GPT-4, GPT-4o) - uses tiktoken */
  OpenAI = 'openai',
  /** Google Gemini models - uses SentencePiece, 25% more efficient for multilingual */
  Gemini = 'gemini',
  /** Anthropic Claude models - less efficient for Arabic */
  Claude = 'claude',
  /** xAI Grok models - similar to OpenAI */
  Grok = 'grok',
  /** Generic/default estimation - balanced middle ground */
  Generic = 'generic',
}

estimateTokenCount

LLM-aware token estimation with provider-specific configurations. Uses a single-pass O(N) classifier for performance and correctness. Algorithm features:
  • Single pass iteration over code points (avoiding memory spikes from match() arrays)
  • Exclusive classification (preventing double-counting overlaps)
  • Additive overhead application (preventing overhead bleeding into other scripts)
  • Run-length encoding approximation for numerals (better BPE simulation)
Research findings:
  • OpenAI: ~4 chars/token English, ~1.3 chars/token Arabic (3x inflation)
  • Gemini: 25% more efficient than OpenAI for Arabic (SentencePiece-based)
  • Claude: ~3.5 chars/token English, less efficient for Arabic
  • Grok: Similar to OpenAI (standard BPE)
text
string
The input text to estimate tokens for
provider
LLMProvider
default:"LLMProvider.Generic"
The LLM provider to use for estimation
Returns: number - Estimated token count.
import { estimateTokenCount, LLMProvider } from 'bitaboom';

// Using default (Generic) provider
const tokens = estimateTokenCount("مرحبا بك في العالم");

// Using specific provider
const openaiTokens = estimateTokenCount(
  "مرحبا بك في العالم",
  LLMProvider.OpenAI
);

const geminiTokens = estimateTokenCount(
  "مرحبا بك في العالم",
  LLMProvider.Gemini
);

Build docs developers (and LLMs) love