arabicNumeralToNumber
Converts Arabic-Indic numerals (٠-٩) to a JavaScript number. This function finds all Arabic-Indic digits in the input string and converts them to their corresponding Arabic (Western) digits, then parses the result as an integer. Arabic-Indic digits mapping:- ٠ → 0, ١ → 1, ٢ → 2, ٣ → 3, ٤ → 4
- ٥ → 5, ٦ → 6, ٧ → 7, ٨ → 8, ٩ → 9
The string containing Arabic-Indic numerals to convert
number - The parsed integer value of the converted numerals. Returns NaN if no valid Arabic-Indic digits are found.
cleanExtremeArabicUnderscores
Removes extreme Arabic underscores (ـ) that appear at the beginning or end of a line or in text. Does not affect Hijri dates (e.g., 1424هـ) or specific Arabic terms.The input text to apply the rule to
string - The modified text with extreme underscores removed.
convertUrduSymbolsToArabic
Converts Urdu symbols to their Arabic equivalents.The input text containing Urdu symbols
string - The modified text with Urdu symbols converted to Arabic symbols.
getArabicScore
Calculates the proportion of Arabic characters in text relative to total non-whitespace, non-digit characters. Digits (ASCII and Arabic-Indic variants) are excluded from both numerator and denominator.The input text to analyze
number - A decimal between 0-1 representing the Arabic character ratio (0 = no Arabic, 1 = all Arabic).
findLastPunctuation
Finds the position of the last punctuation character in a string.The text to search through
number - The index of the last punctuation character, or -1 if none found.
fixTrailingWow
Fixes the trailing “و” (wow) in phrases such as “عليكم و رحمة” to “عليكم ورحمة”. This function attempts to correct phrases where “و” appears unnecessarily, particularly in greetings.The input text containing the “و” character
string - The modified text with unnecessary trailing “و” characters corrected.
addSpaceBetweenArabicTextAndNumbers
Inserts a space between Arabic text and numbers.The input text containing Arabic text followed by numbers
string - The modified text with spaces inserted between Arabic text and numbers.
removeNonIndexSignatures
Removes single-digit numbers surrounded by Arabic text. Also removes dashes (-) not followed by a number. For example, removes ‘3’ from ‘وهب 3 وقال’ but does not remove ‘121’ from ‘لوحه 121 الجرح’.The input text to apply the rule to
string - The modified text with non-index numbers and dashes removed.
removeSingularCodes
Removes characters enclosed in square brackets [] or parentheses () if they are Arabic letters or Arabic-Indic numerals.The input text to apply the rule to
string - The modified text with singular codes removed.
removeSolitaryArabicLetters
Removes solitary Arabic letters unless they are the ‘ha’ letter, which is used in Hijri years.The input text to apply the rule to
string - The modified text with solitary Arabic letters removed.
replaceEnglishPunctuationWithArabic
Replaces English punctuation (question mark and semicolon) with their Arabic equivalents.The input text to apply the rule to
string - The modified text with English punctuation replaced by Arabic punctuation.
countWords
Counts words in text by splitting on whitespace. Works for both Arabic and English text.The text to count words in
number - Number of words in the text.
LLM token estimation
LLMProvider
Supported LLM providers for token estimation. Each provider has different tokenization characteristics based on their BPE implementation.estimateTokenCount
LLM-aware token estimation with provider-specific configurations. Uses a single-pass O(N) classifier for performance and correctness. Algorithm features:- Single pass iteration over code points (avoiding memory spikes from match() arrays)
- Exclusive classification (preventing double-counting overlaps)
- Additive overhead application (preventing overhead bleeding into other scripts)
- Run-length encoding approximation for numerals (better BPE simulation)
- OpenAI: ~4 chars/token English, ~1.3 chars/token Arabic (3x inflation)
- Gemini: 25% more efficient than OpenAI for Arabic (SentencePiece-based)
- Claude: ~3.5 chars/token English, less efficient for Arabic
- Grok: Similar to OpenAI (standard BPE)
The input text to estimate tokens for
The LLM provider to use for estimation
number - Estimated token count.