The preformatArabicText function is a highly optimized single-pass formatter that consolidates dozens of common Arabic text cleanup operations into one efficient pipeline.
Quick start
import { preformatArabicText } from 'bitaboom' ;
// Single string
const formatted = preformatArabicText ( 'بِسْمِ اللَّهِ ( الرَّحْمَنِ ) 127 / 11 قَالَ ...' );
// Batch mode (array of strings)
const pages = [ 'صفحة 1 ...' , 'صفحة 2 ...' , 'صفحة 3 ...' ];
const formattedPages = preformatArabicText ( pages );
What it does
The preformatter applies these transformations in a single pass:
Punctuation normalization
Replace ? with ؟ (Arabic question mark)
Replace ; with ؛ (Arabic semicolon)
Remove redundant punctuation after ؟ or !
Clean spacing around all punctuation marks
Spacing normalization
Collapse multiple spaces/tabs to single space
Remove spaces before punctuation and closing brackets
Add spaces after punctuation (except in special cases)
Fix spacing around quotes and guillemets
Normalize slash spacing in references (e.g., 127 / 11 → 127/11)
Bracket and quote cleanup
Convert ((text)) to «text»
Remove spaces inside brackets and quotes
Ensure spacing before opening brackets
Character condensation
Collapse multiple dots (..) into ellipsis (…)
Condense repeated dashes, underscores, asterisks
Condense tatweel (ـ) repetitions
Normalize colons (.:. → :)
Arabic-specific fixes
Fix trailing wow spacing (عليكم و رحمة → عليكم ورحمة)
Newline normalization
Reduce multiple consecutive newlines to max 1
Trim whitespace from line edges
Remove trailing/leading spaces
The preformatter is significantly faster than chaining individual formatting functions:
import { preformatArabicText } from 'bitaboom' ;
// Efficient single-pass approach
const result = preformatArabicText ( largeText );
// ❌ Slow multi-pass approach (DON'T DO THIS)
let result = replaceEnglishPunctuationWithArabic ( text );
result = normalizeSpaces ( result );
result = condenseEllipsis ( result );
result = fixTrailingWow ( result );
// ... 10+ more passes
For 100KB+ of text, the single-pass preformatter can be 10-50x faster than chaining individual functions.
Benchmark results
Performance tests on real-world Arabic manuscripts:
Text size Individual functions preformatArabicTextSpeedup 1 KB ~2ms ~0.2ms 10x 10 KB ~25ms ~1.5ms 16x 100 KB ~350ms ~12ms 29x 1 MB ~4200ms ~140ms 30x
Batch processing
Process multiple documents efficiently:
import { preformatArabicText } from 'bitaboom' ;
// Automatic batch processing
const chapters = [
'الفصل الأول: المقدمة ...' ,
'الفصل الثاني: الموضوع ...' ,
'الفصل الثالث: الخاتمة ...'
];
const formatted = preformatArabicText ( chapters );
// Returns array with same length, each element formatted
Batch mode processes each string independently and returns an array of the same length. It’s equivalent to chapters.map(preformatArabicText) but with clearer intent.
Real-world patterns
Clean OCR output
import { preformatArabicText } from 'bitaboom' ;
import { convertUrduSymbolsToArabic } from 'bitaboom' ;
function cleanOCRManuscript ( rawText : string ) : string {
// Convert Urdu symbols first (preformat doesn't do this)
let text = convertUrduSymbolsToArabic ( rawText );
// Apply comprehensive preformatting
text = preformatArabicText ( text );
return text ;
}
const ocrOutput = getOCRText ();
const clean = cleanOCRManuscript ( ocrOutput );
Process scanned book pages
import { preformatArabicText } from 'bitaboom' ;
import { removeAllTags } from 'bitaboom' ;
interface Page {
number : number ;
rawContent : string ;
}
function processBook ( pages : Page []) {
// Extract text content
const texts = pages . map ( p => removeAllTags ( p . rawContent ));
// Batch preformat all pages
const formatted = preformatArabicText ( texts );
// Recombine with metadata
return pages . map (( page , i ) => ({
... page ,
cleanContent: formatted [ i ]
}));
}
Prepare for LLM processing
import { preformatArabicText , estimateTokenCount , LLMProvider } from 'bitaboom' ;
function prepareForLLM ( rawText : string , provider : LLMProvider , maxTokens = 4000 ) {
// Clean and normalize
const formatted = preformatArabicText ( rawText );
// Check token budget
const tokens = estimateTokenCount ( formatted , provider );
if ( tokens > maxTokens ) {
console . warn ( `Text exceeds token budget: ${ tokens } > ${ maxTokens } ` );
// Implement chunking strategy
}
return {
text: formatted ,
tokens ,
withinBudget: tokens <= maxTokens
};
}
Create searchable content
import { preformatArabicText , makeDiacriticInsensitiveRegex } from 'bitaboom' ;
function indexArabicDocuments ( documents : string []) {
// Preformat for consistent search
const normalized = preformatArabicText ( documents );
return normalized . map (( text , index ) => ({
id: index ,
content: text ,
searchable: text . toLowerCase ()
}));
}
function search ( index : ReturnType < typeof indexArabicDocuments >, query : string ) {
const queryFormatted = preformatArabicText ( query );
const rx = makeDiacriticInsensitiveRegex ( queryFormatted );
return index . filter ( doc => rx . test ( doc . searchable ));
}
Advanced usage
Pipeline with additional steps
import {
preformatArabicText ,
convertUrduSymbolsToArabic ,
removeNonIndexSignatures ,
removeSolitaryArabicLetters
} from 'bitaboom' ;
function fullPipeline ( text : string ) {
let result = text ;
// Pre-preformat steps (not included in preformat)
result = convertUrduSymbolsToArabic ( result );
result = removeNonIndexSignatures ( result );
result = removeSolitaryArabicLetters ( result );
// Main preformatting
result = preformatArabicText ( result );
return result ;
}
import { preformatArabicText , getArabicScore } from 'bitaboom' ;
function smartFormat ( text : string , threshold = 0.5 ) {
const arabicScore = getArabicScore ( text );
if ( arabicScore > threshold ) {
// Arabic content - use preformat
return preformatArabicText ( text );
} else {
// Non-Arabic content - basic cleanup only
return text . trim (). replace ( / \s + / g , ' ' );
}
}
Streaming processing
import { preformatArabicText } from 'bitaboom' ;
async function* processLargeFile ( filePath : string , chunkSize = 1000 ) {
const file = await Bun . file ( filePath ). text ();
const lines = file . split ( ' \n ' );
for ( let i = 0 ; i < lines . length ; i += chunkSize ) {
const chunk = lines . slice ( i , i + chunkSize );
const formatted = preformatArabicText ( chunk );
yield formatted ;
}
}
// Usage
for await ( const formattedChunk of processLargeFile ( 'large-book.txt' )) {
await saveToDatabase ( formattedChunk );
}
Implementation details
Single-pass architecture
The preformatter uses advanced optimizations:
Character code lookup tables (faster than regex)
Bitflag-based classification (single pass, no double-counting)
State machine for context-aware transformations
Efficient string builder (minimizes allocations)
// Internal architecture (simplified)
class Preformatter {
private i = 0 ;
private lastCode = 0 ;
private pendingSpaces = 0 ;
process () {
// Single loop over characters
while ( this . i < this . len ) {
// Classify character using lookup table
// Apply transformations based on context
// Emit to output buffer
}
}
}
Environment variables
For benchmarking, you can force specific implementations:
# Force string concatenation builder (default)
export BITABOOM_PREFORMAT_BUILDER = concat
# Force UTF-16 buffer builder (for very large texts)
export BITABOOM_PREFORMAT_BUILDER = buffer
The default concat builder is typically faster for page-sized inputs (1-100KB). The buffer builder is optimized for very large inputs (1MB+) to reduce GC pressure.
What it doesn’t do
The preformatter does not include these operations (use dedicated functions):
Urdu symbol conversion (convertUrduSymbolsToArabic)
Arabic numeral conversion (arabicNumeralToNumber)
Removing references (removeNonIndexSignatures)
Removing solitary letters (removeSolitaryArabicLetters)
Removing singular codes (removeSingularCodes)
Stripping tags (removeAllTags)
Smart quotes (basic quote spacing is included)
Title case conversion (toTitleCase)
Styling removal (stripStyling)
Sentence-based formatting (formatStringBySentence)
Line breaks after punctuation (insertLineBreaksAfterPunctuation)
Comparison with individual functions
Best practices
Use batch mode for multiple documents
// ✅ Efficient
const formatted = preformatArabicText ( documents );
// ❌ Less efficient
const formatted = documents . map ( preformatArabicText );
Apply Urdu conversion first
let text = convertUrduSymbolsToArabic ( rawText );
text = preformatArabicText ( text );
Combine with content removal as needed
let text = preformatArabicText ( rawText );
text = removeNonIndexSignatures ( text );
text = removeSolitaryArabicLetters ( text );
Validate before expensive operations
const formatted = preformatArabicText ( text );
const tokens = estimateTokenCount ( formatted , provider );
if ( tokens <= maxTokens ) {
await sendToLLM ( formatted );
}
The preformatter modifies whitespace, punctuation, and formatting. If you need to preserve exact original formatting, store both the original and formatted versions.
For maximum performance on very large datasets (100MB+), consider using the buffer builder with BITABOOM_PREFORMAT_BUILDER=buffer.