Quickstart

Bitaboom is a TypeScript-first string utility toolkit focused on Arabic and bilingual (Arabic ↔ English) publishing workflows. This guide will walk you through the most common use cases with real code examples.

Installation

Install Bitaboom

Choose your preferred package manager:

npm install bitaboom

Import the functions you need

Bitaboom exports are tree-shakeable, so you only bundle what you use:

import { 
  makeDiacriticInsensitiveRegex, 
  removeMarkdownFormatting,
  preformatArabicText,
  estimateTokenCount,
  LLMProvider
} from 'bitaboom';

Core Features

Diacritic-Insensitive Matching

Build regular expressions that ignore Arabic diacritics, tatweel, and whitespace variants for flexible text search:

import { makeDiacriticInsensitiveRegex } from 'bitaboom';

// Create a regex that matches regardless of diacritics
const pattern = makeDiacriticInsensitiveRegex('أنا إلى الآفاق');

// Matches text without diacritics
pattern.test('انا الي الافاق'); // true

// Also handles tatweel (Arabic text elongation)
pattern.test('أنــــا إلــــى الآفــــاق'); // true

This is perfect for search functionality in Arabic text applications where users might not type diacritics consistently.

Remove Markdown Formatting

Strip markdown syntax from text while preserving the actual content:

import { removeMarkdownFormatting } from 'bitaboom';

const markdown = '**Bold** _italic_ [link](https://example.com)';
const plain = removeMarkdownFormatting(markdown);

console.log(plain); // "Bold italic link"

Common use cases:

Extracting plain text from markdown documents
Preparing text for search indexing
Generating preview snippets

High-Performance Arabic Preformatting

The preformatArabicText function is a single-pass optimizer for cleaning messy Arabic text, especially from OCR or scanned documents:

import { preformatArabicText } from 'bitaboom';

const messy = 'بِسْمِ  اللَّهِ ( الرَّحْمَنِ ) 127 / 11 قَالَ ...';
const clean = preformatArabicText(messy);

// Fixes:
// - Multiple spaces → single space
// - Brackets spacing
// - Reference numbers
// - Ellipsis formatting

The preformatter consolidates 30+ formatting operations into a single optimized pass. It handles spacing, punctuation, brackets, ellipses, references, and more.

LLM Token Estimation

Accurately estimate token counts for Arabic text across different LLM providers. Arabic text uses approximately 3x more tokens than English due to BPE tokenization:

import { estimateTokenCount } from 'bitaboom';

const text = 'بسم الله الرحمن الرحيم';
const tokens = estimateTokenCount(text);

console.log(`Estimated tokens: ${tokens}`);

Provider Efficiency Rankings:

Gemini: Most efficient (~25% fewer tokens than OpenAI)
OpenAI/Grok: Standard BPE baseline
Claude: Least efficient for Arabic (uses ~20% more tokens)

Common Workflow Patterns

Clean and Search Arabic Text

Combine preprocessing with search for robust Arabic text matching:

import { 
  preformatArabicText, 
  makeDiacriticInsensitiveRegex 
} from 'bitaboom';

// Step 1: Clean the source text
const rawText = 'بِسْمِ  اللَّهِ ( الرَّحْمَنِ  الرَّحِيمِ ) ...';
const cleanText = preformatArabicText(rawText);

// Step 2: Create flexible search pattern
const searchPattern = makeDiacriticInsensitiveRegex('بسم الله');

// Step 3: Search in cleaned text
if (searchPattern.test(cleanText)) {
  console.log('Found match!');
}

Prepare Text for LLM Processing

Estimate tokens before sending to APIs to optimize costs and stay within limits:

import { 
  preformatArabicText, 
  estimateTokenCount,
  LLMProvider 
} from 'bitaboom';

// Clean the text first
const rawArabic = 'نص عربي طويل جدا ...';
const cleaned = preformatArabicText(rawArabic);

// Estimate tokens for your provider
const tokens = estimateTokenCount(cleaned, LLMProvider.OpenAI);

// Check against model limits
const MAX_TOKENS = 4096;
if (tokens > MAX_TOKENS) {
  console.warn(`Text exceeds limit: ${tokens}/${MAX_TOKENS} tokens`);
  // Implement chunking strategy
}

Bilingual Document Processing

Handle mixed Arabic and English content:

import { 
  removeMarkdownFormatting,
  getArabicScore,
  preformatArabicText 
} from 'bitaboom';

const bilingualText = `
**بسم الله** - _In the name of Allah_
قَالَ رَسُولُ اللَّهِ (ﷺ) - The Prophet said (peace be upon him)
`;

// Remove markdown formatting
const plain = removeMarkdownFormatting(bilingualText);

// Detect if text is primarily Arabic
const arabicRatio = getArabicScore(plain);
if (arabicRatio > 0.5) {
  // Apply Arabic-specific formatting
  const formatted = preformatArabicText(plain);
  console.log(formatted);
}

Interactive Demo

Explore all 80+ functions with live examples at bitaboom.surge.sh

Next Steps

Arabic Utilities

Diacritic handling, Urdu conversion, token estimation, and Arabic-specific text processing

Formatting & Typography

Smart quotes, punctuation spacing, bracket normalization, and 30+ typography helpers

Sanitization

Remove references, URLs, markdown, escaped spaces, and other noise from text

Transliteration

Normalize Arabic prefixes, clean up apostrophes, and replace salutations with ﷺ

Get Started

Guides

Examples

Installation

Core Features

Diacritic-Insensitive Matching

Remove Markdown Formatting

High-Performance Arabic Preformatting

LLM Token Estimation

Common Workflow Patterns

Clean and Search Arabic Text

Prepare Text for LLM Processing

Bilingual Document Processing

Interactive Demo

Next Steps

Arabic Utilities

Formatting & Typography

Sanitization

Transliteration

Build docs developers (and LLMs) love

Get Started

Guides

Examples

​Installation

​Core Features

​Diacritic-Insensitive Matching

​Remove Markdown Formatting

​High-Performance Arabic Preformatting

​LLM Token Estimation

​Common Workflow Patterns

​Clean and Search Arabic Text

​Prepare Text for LLM Processing

​Bilingual Document Processing

​Interactive Demo

​Next Steps

Arabic Utilities

Formatting & Typography

Sanitization

Transliteration

Build docs developers (and LLMs) love

Installation

Core Features

Diacritic-Insensitive Matching

Remove Markdown Formatting

High-Performance Arabic Preformatting

LLM Token Estimation

Common Workflow Patterns

Clean and Search Arabic Text

Prepare Text for LLM Processing

Bilingual Document Processing

Interactive Demo

Next Steps