Skip to main content
Bitaboom provides a comprehensive suite of utilities for processing Arabic text, from basic character conversion to advanced scoring and analysis.

Arabic numeral conversion

Convert Arabic-Indic numerals (٠-٩) to JavaScript numbers:
import { arabicNumeralToNumber } from 'bitaboom';

arabicNumeralToNumber("١٢٣"); // returns 123
arabicNumeralToNumber("٥٠"); // returns 50
arabicNumeralToNumber("abc١٢٣xyz"); // returns 123 (non-digits ignored)
arabicNumeralToNumber(""); // returns NaN
The function maps these digits:
  • ٠ → 0, ١ → 1, ٢ → 2, ٣ → 3, ٤ → 4
  • ٥ → 5, ٦ → 6, ٧ → 7, ٨ → 8, ٩ → 9

Urdu symbol normalization

Convert Urdu variants to standard Arabic equivalents:
import { convertUrduSymbolsToArabic } from 'bitaboom';

convertUrduSymbolsToArabic('ھذا'); // returns 'هذا'
convertUrduSymbolsToArabic('ی'); // returns 'ي'
This handles:
  • ھ (Urdu heh) → ه (Arabic heh)
  • ی (Urdu yeh) → ي (Arabic yeh)

Arabic content scoring

Calculate the proportion of Arabic characters in text:
import { getArabicScore } from 'bitaboom';

getArabicScore('مرحبا'); // returns ~1.0 (all Arabic)
getArabicScore('Hello مرحبا'); // returns ~0.5 (mixed)
getArabicScore('Hello World'); // returns 0 (no Arabic)
getArabicScore(''); // returns 0
The score is calculated as Arabic characters divided by total non-whitespace, non-digit characters. Both Arabic and Western digits are excluded from the calculation.

Use cases

const text = getUserInput();
const score = getArabicScore(text);

if (score > 0.7) {
  processAsArabic(text);
} else if (score > 0.3) {
  processAsBilingual(text);
} else {
  processAsEnglish(text);
}
function validateArabicContent(text: string, minScore: number = 0.8) {
  const score = getArabicScore(text);
  if (score < minScore) {
    throw new Error(`Expected primarily Arabic content, got ${(score * 100).toFixed(1)}%`);
  }
  return true;
}

Word counting

Count words in both Arabic and English text:
import { countWords } from 'bitaboom';

countWords('بسم الله الرحمن الرحيم'); // returns 4
countWords('Hello world'); // returns 2
countWords(''); // returns 0
countWords('  multiple   spaces  '); // returns 2

Punctuation handling

Replace English punctuation with Arabic equivalents

import { replaceEnglishPunctuationWithArabic } from 'bitaboom';

replaceEnglishPunctuationWithArabic('كيف حالك?'); // returns 'كيف حالك؟'
replaceEnglishPunctuationWithArabic('نعم; لا'); // returns 'نعم؛ لا'
replaceEnglishPunctuationWithArabic('أولا, ثانيا'); // returns 'أولا، ثانيا'
Conversions:
  • ?؟ (Arabic question mark)
  • ;؛ (Arabic semicolon)
  • ,، (Arabic comma)

Find last punctuation

import { findLastPunctuation } from 'bitaboom';

const text = "Hello world! How are you?";
const lastPuncIndex = findLastPunctuation(text);
// Result: 24 (position of the last '?')

const noPuncText = "Hello world";
const notFound = findLastPunctuation(noPuncText);
// Result: -1 (no punctuation found)

Text cleaning

Clean extreme Arabic underscores

Remove decorative tatweel (ـ) at line edges while preserving Hijri dates:
import { cleanExtremeArabicUnderscores } from 'bitaboom';

cleanExtremeArabicUnderscores('ـThis is a textـ'); // returns 'This is a text'
cleanExtremeArabicUnderscores('1424هـ'); // returns '1424هـ' (preserved)

Fix trailing wow (و)

Correct spacing around the conjunction “و” in common phrases:
import { fixTrailingWow } from 'bitaboom';

fixTrailingWow('السلام عليكم و رحمة'); // returns 'السلام عليكم ورحمة'
fixTrailingWow('عليكم و رحمة'); // returns 'عليكم ورحمة'

Add spacing between Arabic text and numbers

import { addSpaceBetweenArabicTextAndNumbers } from 'bitaboom';

addSpaceBetweenArabicTextAndNumbers('الآية37'); // returns 'الآية 37'
addSpaceBetweenArabicTextAndNumbers('صفحة١٢٣'); // returns 'صفحة ١٢٣'

Advanced cleaning

Remove non-index signatures

Remove single-digit numbers and stray dashes surrounded by Arabic text:
import { removeNonIndexSignatures } from 'bitaboom';

removeNonIndexSignatures('وهب 3 وقال'); // removes the '3'
removeNonIndexSignatures('لوحه 121 الجرح'); // preserves '121' (multi-digit)

Remove singular codes

Strip single Arabic letters or digits in brackets:
import { removeSingularCodes } from 'bitaboom';

removeSingularCodes('[س]'); // returns ''
removeSingularCodes('(س)'); // returns ''
removeSingularCodes('[سورة]'); // preserved (multiple chars)

Remove solitary Arabic letters

Clean up isolated Arabic letters while preserving Hijri year markers:
import { removeSolitaryArabicLetters } from 'bitaboom';

removeSolitaryArabicLetters('ب ا الكلمات ت'); // returns 'ا الكلمات'
removeSolitaryArabicLetters('1424 ه'); // preserves 'ه' (Hijri marker)

Diacritic-insensitive matching

Build flexible regular expressions for Arabic text search:
import { makeDiacriticInsensitiveRegex } from 'bitaboom';

const rx = makeDiacriticInsensitiveRegex('أنا إلى الآفاق');
rx.test('انا الي الافاق'); // true - ignores hamza variants
rx.test('اَنا إلى الآفاق'); // true - ignores diacritics

Configuration options

const rx = makeDiacriticInsensitiveRegex('أنا');
// Enables all equivalences by default:
// - alif variants (ا/أ/إ/آ)
// - ta marbuta/ha (ة/ه)
// - alif maqsura/ya (ى/ي)
// - tatweel tolerance
// - diacritic ignoring
The function throws an error if the input exceeds 5000 characters to prevent excessive pattern sizes.

Real-world patterns

Normalize OCR output

import {
  convertUrduSymbolsToArabic,
  replaceEnglishPunctuationWithArabic,
  addSpaceBetweenArabicTextAndNumbers,
  cleanExtremeArabicUnderscores
} from 'bitaboom';

function normalizeOCRText(text: string): string {
  let result = text;
  result = convertUrduSymbolsToArabic(result);
  result = replaceEnglishPunctuationWithArabic(result);
  result = addSpaceBetweenArabicTextAndNumbers(result);
  result = cleanExtremeArabicUnderscores(result);
  return result;
}

Validate bilingual content

import { getArabicScore, countWords } from 'bitaboom';

function validateBilingualBook(chapters: string[]) {
  return chapters.map((chapter, index) => {
    const score = getArabicScore(chapter);
    const words = countWords(chapter);
    
    return {
      chapter: index + 1,
      words,
      arabicScore: (score * 100).toFixed(1) + '%',
      language: score > 0.7 ? 'Arabic' : score > 0.3 ? 'Mixed' : 'English'
    };
  });
}

Search with tolerance

import { makeDiacriticInsensitiveRegex } from 'bitaboom';

function searchArabicText(corpus: string[], query: string) {
  const rx = makeDiacriticInsensitiveRegex(query, { flags: 'gu' });
  
  return corpus
    .map((text, index) => {
      const matches = text.match(rx);
      return matches ? { index, count: matches.length, text } : null;
    })
    .filter(Boolean);
}

Best practices

1

Normalize before processing

Always convert Urdu symbols and replace English punctuation before further processing:
let text = convertUrduSymbolsToArabic(rawText);
text = replaceEnglishPunctuationWithArabic(text);
2

Use scoring for routing

Let getArabicScore determine the appropriate processing pipeline:
const score = getArabicScore(text);
if (score > 0.7) {
  // Apply Arabic-specific processing
}
3

Preserve semantic markers

Be careful with cleaning functions that might remove meaningful content:
// cleanExtremeArabicUnderscores preserves Hijri dates
// removeSolitaryArabicLetters preserves 'ه' in dates
4

Combine with preformatting

For comprehensive text normalization, use the preformatting pipeline (see preformatting guide).
All Arabic text processing functions handle empty strings gracefully and return sensible defaults (0 for numeric functions, empty string for text functions).

Build docs developers (and LLMs) love