Get started with Bitaboom in minutes with practical examples for Arabic and bilingual text processing
Bitaboom is a TypeScript-first string utility toolkit focused on Arabic and bilingual (Arabic ↔ English) publishing workflows. This guide will walk you through the most common use cases with real code examples.
Build regular expressions that ignore Arabic diacritics, tatweel, and whitespace variants for flexible text search:
import { makeDiacriticInsensitiveRegex } from 'bitaboom';// Create a regex that matches regardless of diacriticsconst pattern = makeDiacriticInsensitiveRegex('أنا إلى الآفاق');// Matches text without diacriticspattern.test('انا الي الافاق'); // true// Also handles tatweel (Arabic text elongation)pattern.test('أنــــا إلــــى الآفــــاق'); // true
This is perfect for search functionality in Arabic text applications where users might not type diacritics consistently.
The preformatter consolidates 30+ formatting operations into a single optimized pass. It handles spacing, punctuation, brackets, ellipses, references, and more.
Accurately estimate token counts for Arabic text across different LLM providers. Arabic text uses approximately 3x more tokens than English due to BPE tokenization:
import { estimateTokenCount } from 'bitaboom';const text = 'بسم الله الرحمن الرحيم';const tokens = estimateTokenCount(text);console.log(`Estimated tokens: ${tokens}`);
Provider Efficiency Rankings:
Gemini: Most efficient (~25% fewer tokens than OpenAI)
OpenAI/Grok: Standard BPE baseline
Claude: Least efficient for Arabic (uses ~20% more tokens)
Estimate tokens before sending to APIs to optimize costs and stay within limits:
import { preformatArabicText, estimateTokenCount, LLMProvider} from 'bitaboom';// Clean the text firstconst rawArabic = 'نص عربي طويل جدا ...';const cleaned = preformatArabicText(rawArabic);// Estimate tokens for your providerconst tokens = estimateTokenCount(cleaned, LLMProvider.OpenAI);// Check against model limitsconst MAX_TOKENS = 4096;if (tokens > MAX_TOKENS) { console.warn(`Text exceeds limit: ${tokens}/${MAX_TOKENS} tokens`); // Implement chunking strategy}