Skip to main content
The preformatArabicText function is a highly optimized single-pass formatter that consolidates dozens of common Arabic text cleanup operations into one efficient pipeline.

Quick start

import { preformatArabicText } from 'bitaboom';

// Single string
const formatted = preformatArabicText('بِسْمِ  اللَّهِ ( الرَّحْمَنِ ) 127 / 11 قَالَ ...');

// Batch mode (array of strings)
const pages = ['صفحة 1 ...', 'صفحة 2 ...', 'صفحة 3 ...'];
const formattedPages = preformatArabicText(pages);

What it does

The preformatter applies these transformations in a single pass:
1

Punctuation normalization

  • Replace ? with ؟ (Arabic question mark)
  • Replace ; with ؛ (Arabic semicolon)
  • Remove redundant punctuation after ؟ or !
  • Clean spacing around all punctuation marks
2

Spacing normalization

  • Collapse multiple spaces/tabs to single space
  • Remove spaces before punctuation and closing brackets
  • Add spaces after punctuation (except in special cases)
  • Fix spacing around quotes and guillemets
  • Normalize slash spacing in references (e.g., 127 / 11127/11)
3

Bracket and quote cleanup

  • Convert ((text)) to «text»
  • Remove spaces inside brackets and quotes
  • Ensure spacing before opening brackets
4

Character condensation

  • Collapse multiple dots (..) into ellipsis ()
  • Condense repeated dashes, underscores, asterisks
  • Condense tatweel (ـ) repetitions
  • Normalize colons (.:.:)
5

Arabic-specific fixes

  • Fix trailing wow spacing (عليكم و رحمةعليكم ورحمة)
6

Newline normalization

  • Reduce multiple consecutive newlines to max 1
  • Trim whitespace from line edges
  • Remove trailing/leading spaces

Performance

The preformatter is significantly faster than chaining individual formatting functions:
import { preformatArabicText } from 'bitaboom';

// Efficient single-pass approach
const result = preformatArabicText(largeText);

// ❌ Slow multi-pass approach (DON'T DO THIS)
let result = replaceEnglishPunctuationWithArabic(text);
result = normalizeSpaces(result);
result = condenseEllipsis(result);
result = fixTrailingWow(result);
// ... 10+ more passes
For 100KB+ of text, the single-pass preformatter can be 10-50x faster than chaining individual functions.

Benchmark results

Performance tests on real-world Arabic manuscripts:
Text sizeIndividual functionspreformatArabicTextSpeedup
1 KB~2ms~0.2ms10x
10 KB~25ms~1.5ms16x
100 KB~350ms~12ms29x
1 MB~4200ms~140ms30x

Batch processing

Process multiple documents efficiently:
import { preformatArabicText } from 'bitaboom';

// Automatic batch processing
const chapters = [
  'الفصل الأول: المقدمة ...',
  'الفصل الثاني: الموضوع ...',
  'الفصل الثالث: الخاتمة ...'
];

const formatted = preformatArabicText(chapters);
// Returns array with same length, each element formatted
Batch mode processes each string independently and returns an array of the same length. It’s equivalent to chapters.map(preformatArabicText) but with clearer intent.

Real-world patterns

Clean OCR output

import { preformatArabicText } from 'bitaboom';
import { convertUrduSymbolsToArabic } from 'bitaboom';

function cleanOCRManuscript(rawText: string): string {
  // Convert Urdu symbols first (preformat doesn't do this)
  let text = convertUrduSymbolsToArabic(rawText);
  
  // Apply comprehensive preformatting
  text = preformatArabicText(text);
  
  return text;
}

const ocrOutput = getOCRText();
const clean = cleanOCRManuscript(ocrOutput);

Process scanned book pages

import { preformatArabicText } from 'bitaboom';
import { removeAllTags } from 'bitaboom';

interface Page {
  number: number;
  rawContent: string;
}

function processBook(pages: Page[]) {
  // Extract text content
  const texts = pages.map(p => removeAllTags(p.rawContent));
  
  // Batch preformat all pages
  const formatted = preformatArabicText(texts);
  
  // Recombine with metadata
  return pages.map((page, i) => ({
    ...page,
    cleanContent: formatted[i]
  }));
}

Prepare for LLM processing

import { preformatArabicText, estimateTokenCount, LLMProvider } from 'bitaboom';

function prepareForLLM(rawText: string, provider: LLMProvider, maxTokens = 4000) {
  // Clean and normalize
  const formatted = preformatArabicText(rawText);
  
  // Check token budget
  const tokens = estimateTokenCount(formatted, provider);
  
  if (tokens > maxTokens) {
    console.warn(`Text exceeds token budget: ${tokens} > ${maxTokens}`);
    // Implement chunking strategy
  }
  
  return {
    text: formatted,
    tokens,
    withinBudget: tokens <= maxTokens
  };
}

Create searchable content

import { preformatArabicText, makeDiacriticInsensitiveRegex } from 'bitaboom';

function indexArabicDocuments(documents: string[]) {
  // Preformat for consistent search
  const normalized = preformatArabicText(documents);
  
  return normalized.map((text, index) => ({
    id: index,
    content: text,
    searchable: text.toLowerCase()
  }));
}

function search(index: ReturnType<typeof indexArabicDocuments>, query: string) {
  const queryFormatted = preformatArabicText(query);
  const rx = makeDiacriticInsensitiveRegex(queryFormatted);
  
  return index.filter(doc => rx.test(doc.searchable));
}

Advanced usage

Pipeline with additional steps

import {
  preformatArabicText,
  convertUrduSymbolsToArabic,
  removeNonIndexSignatures,
  removeSolitaryArabicLetters
} from 'bitaboom';

function fullPipeline(text: string) {
  let result = text;
  
  // Pre-preformat steps (not included in preformat)
  result = convertUrduSymbolsToArabic(result);
  result = removeNonIndexSignatures(result);
  result = removeSolitaryArabicLetters(result);
  
  // Main preformatting
  result = preformatArabicText(result);
  
  return result;
}

Conditional formatting

import { preformatArabicText, getArabicScore } from 'bitaboom';

function smartFormat(text: string, threshold = 0.5) {
  const arabicScore = getArabicScore(text);
  
  if (arabicScore > threshold) {
    // Arabic content - use preformat
    return preformatArabicText(text);
  } else {
    // Non-Arabic content - basic cleanup only
    return text.trim().replace(/\s+/g, ' ');
  }
}

Streaming processing

import { preformatArabicText } from 'bitaboom';

async function* processLargeFile(filePath: string, chunkSize = 1000) {
  const file = await Bun.file(filePath).text();
  const lines = file.split('\n');
  
  for (let i = 0; i < lines.length; i += chunkSize) {
    const chunk = lines.slice(i, i + chunkSize);
    const formatted = preformatArabicText(chunk);
    yield formatted;
  }
}

// Usage
for await (const formattedChunk of processLargeFile('large-book.txt')) {
  await saveToDatabase(formattedChunk);
}

Implementation details

Single-pass architecture

The preformatter uses advanced optimizations:
  • Character code lookup tables (faster than regex)
  • Bitflag-based classification (single pass, no double-counting)
  • State machine for context-aware transformations
  • Efficient string builder (minimizes allocations)
// Internal architecture (simplified)
class Preformatter {
  private i = 0;
  private lastCode = 0;
  private pendingSpaces = 0;
  
  process() {
    // Single loop over characters
    while (this.i < this.len) {
      // Classify character using lookup table
      // Apply transformations based on context
      // Emit to output buffer
    }
  }
}

Environment variables

For benchmarking, you can force specific implementations:
# Force string concatenation builder (default)
export BITABOOM_PREFORMAT_BUILDER=concat

# Force UTF-16 buffer builder (for very large texts)
export BITABOOM_PREFORMAT_BUILDER=buffer
The default concat builder is typically faster for page-sized inputs (1-100KB). The buffer builder is optimized for very large inputs (1MB+) to reduce GC pressure.

What it doesn’t do

The preformatter does not include these operations (use dedicated functions):
  • Urdu symbol conversion (convertUrduSymbolsToArabic)
  • Arabic numeral conversion (arabicNumeralToNumber)
  • Removing references (removeNonIndexSignatures)
  • Removing solitary letters (removeSolitaryArabicLetters)
  • Removing singular codes (removeSingularCodes)
  • Stripping tags (removeAllTags)
  • Smart quotes (basic quote spacing is included)
  • Title case conversion (toTitleCase)
  • Styling removal (stripStyling)
  • Sentence-based formatting (formatStringBySentence)
  • Line breaks after punctuation (insertLineBreaksAfterPunctuation)

Comparison with individual functions

  • Processing large volumes of text (>10KB)
  • Applying multiple formatting operations
  • Performance is critical (batch processing, real-time)
  • You need consistent, comprehensive normalization
// Best for:
const formatted = preformatArabicText(largeManuscript);

Best practices

1

Use batch mode for multiple documents

// ✅ Efficient
const formatted = preformatArabicText(documents);

// ❌ Less efficient
const formatted = documents.map(preformatArabicText);
2

Apply Urdu conversion first

let text = convertUrduSymbolsToArabic(rawText);
text = preformatArabicText(text);
3

Combine with content removal as needed

let text = preformatArabicText(rawText);
text = removeNonIndexSignatures(text);
text = removeSolitaryArabicLetters(text);
4

Validate before expensive operations

const formatted = preformatArabicText(text);
const tokens = estimateTokenCount(formatted, provider);
if (tokens <= maxTokens) {
  await sendToLLM(formatted);
}
The preformatter modifies whitespace, punctuation, and formatting. If you need to preserve exact original formatting, store both the original and formatted versions.
For maximum performance on very large datasets (100MB+), consider using the buffer builder with BITABOOM_PREFORMAT_BUILDER=buffer.

Build docs developers (and LLMs) love