Preformatting pipeline

The preformatArabicText function is a highly optimized single-pass formatter that consolidates dozens of common Arabic text cleanup operations into one efficient pipeline.

Quick start

import { preformatArabicText } from 'bitaboom';

// Single string
const formatted = preformatArabicText('بِسْمِ  اللَّهِ ( الرَّحْمَنِ ) 127 / 11 قَالَ ...');

// Batch mode (array of strings)
const pages = ['صفحة 1 ...', 'صفحة 2 ...', 'صفحة 3 ...'];
const formattedPages = preformatArabicText(pages);

What it does

The preformatter applies these transformations in a single pass:

Punctuation normalization

Replace ? with ؟ (Arabic question mark)
Replace ; with ؛ (Arabic semicolon)
Remove redundant punctuation after ؟ or !
Clean spacing around all punctuation marks

Spacing normalization

Collapse multiple spaces/tabs to single space
Remove spaces before punctuation and closing brackets
Add spaces after punctuation (except in special cases)
Fix spacing around quotes and guillemets
Normalize slash spacing in references (e.g., 127 / 11 → 127/11)

Bracket and quote cleanup

Convert ((text)) to «text»
Remove spaces inside brackets and quotes
Ensure spacing before opening brackets

Character condensation

Collapse multiple dots (..) into ellipsis (…)
Condense repeated dashes, underscores, asterisks
Condense tatweel (ـ) repetitions
Normalize colons (.:. → :)

Arabic-specific fixes

Fix trailing wow spacing (عليكم و رحمة → عليكم ورحمة)

Newline normalization

Reduce multiple consecutive newlines to max 1
Trim whitespace from line edges
Remove trailing/leading spaces

Performance

The preformatter is significantly faster than chaining individual formatting functions:

import { preformatArabicText } from 'bitaboom';

// Efficient single-pass approach
const result = preformatArabicText(largeText);

// ❌ Slow multi-pass approach (DON'T DO THIS)
let result = replaceEnglishPunctuationWithArabic(text);
result = normalizeSpaces(result);
result = condenseEllipsis(result);
result = fixTrailingWow(result);
// ... 10+ more passes

For 100KB+ of text, the single-pass preformatter can be 10-50x faster than chaining individual functions.

Benchmark results

Performance tests on real-world Arabic manuscripts:

Text size	Individual functions	`preformatArabicText`	Speedup
1 KB	~2ms	~0.2ms	10x
10 KB	~25ms	~1.5ms	16x
100 KB	~350ms	~12ms	29x
1 MB	~4200ms	~140ms	30x

Batch processing

Process multiple documents efficiently:

import { preformatArabicText } from 'bitaboom';

// Automatic batch processing
const chapters = [
  'الفصل الأول: المقدمة ...',
  'الفصل الثاني: الموضوع ...',
  'الفصل الثالث: الخاتمة ...'
];

const formatted = preformatArabicText(chapters);
// Returns array with same length, each element formatted

Batch mode processes each string independently and returns an array of the same length. It’s equivalent to chapters.map(preformatArabicText) but with clearer intent.

Real-world patterns

Clean OCR output

import { preformatArabicText } from 'bitaboom';
import { convertUrduSymbolsToArabic } from 'bitaboom';

function cleanOCRManuscript(rawText: string): string {
  // Convert Urdu symbols first (preformat doesn't do this)
  let text = convertUrduSymbolsToArabic(rawText);
  
  // Apply comprehensive preformatting
  text = preformatArabicText(text);
  
  return text;
}

const ocrOutput = getOCRText();
const clean = cleanOCRManuscript(ocrOutput);

Process scanned book pages

import { preformatArabicText } from 'bitaboom';
import { removeAllTags } from 'bitaboom';

interface Page {
  number: number;
  rawContent: string;
}

function processBook(pages: Page[]) {
  // Extract text content
  const texts = pages.map(p => removeAllTags(p.rawContent));
  
  // Batch preformat all pages
  const formatted = preformatArabicText(texts);
  
  // Recombine with metadata
  return pages.map((page, i) => ({
    ...page,
    cleanContent: formatted[i]
  }));
}

Prepare for LLM processing

import { preformatArabicText, estimateTokenCount, LLMProvider } from 'bitaboom';

function prepareForLLM(rawText: string, provider: LLMProvider, maxTokens = 4000) {
  // Clean and normalize
  const formatted = preformatArabicText(rawText);
  
  // Check token budget
  const tokens = estimateTokenCount(formatted, provider);
  
  if (tokens > maxTokens) {
    console.warn(`Text exceeds token budget: ${tokens} > ${maxTokens}`);
    // Implement chunking strategy
  }
  
  return {
    text: formatted,
    tokens,
    withinBudget: tokens <= maxTokens
  };
}

Create searchable content

import { preformatArabicText, makeDiacriticInsensitiveRegex } from 'bitaboom';

function indexArabicDocuments(documents: string[]) {
  // Preformat for consistent search
  const normalized = preformatArabicText(documents);
  
  return normalized.map((text, index) => ({
    id: index,
    content: text,
    searchable: text.toLowerCase()
  }));
}

function search(index: ReturnType<typeof indexArabicDocuments>, query: string) {
  const queryFormatted = preformatArabicText(query);
  const rx = makeDiacriticInsensitiveRegex(queryFormatted);
  
  return index.filter(doc => rx.test(doc.searchable));
}

Advanced usage

Pipeline with additional steps

import {
  preformatArabicText,
  convertUrduSymbolsToArabic,
  removeNonIndexSignatures,
  removeSolitaryArabicLetters
} from 'bitaboom';

function fullPipeline(text: string) {
  let result = text;
  
  // Pre-preformat steps (not included in preformat)
  result = convertUrduSymbolsToArabic(result);
  result = removeNonIndexSignatures(result);
  result = removeSolitaryArabicLetters(result);
  
  // Main preformatting
  result = preformatArabicText(result);
  
  return result;
}

Conditional formatting

import { preformatArabicText, getArabicScore } from 'bitaboom';

function smartFormat(text: string, threshold = 0.5) {
  const arabicScore = getArabicScore(text);
  
  if (arabicScore > threshold) {
    // Arabic content - use preformat
    return preformatArabicText(text);
  } else {
    // Non-Arabic content - basic cleanup only
    return text.trim().replace(/\s+/g, ' ');
  }
}

Streaming processing

import { preformatArabicText } from 'bitaboom';

async function* processLargeFile(filePath: string, chunkSize = 1000) {
  const file = await Bun.file(filePath).text();
  const lines = file.split('\n');
  
  for (let i = 0; i < lines.length; i += chunkSize) {
    const chunk = lines.slice(i, i + chunkSize);
    const formatted = preformatArabicText(chunk);
    yield formatted;
  }
}

// Usage
for await (const formattedChunk of processLargeFile('large-book.txt')) {
  await saveToDatabase(formattedChunk);
}

Implementation details

Single-pass architecture

The preformatter uses advanced optimizations:

Character code lookup tables (faster than regex)
Bitflag-based classification (single pass, no double-counting)
State machine for context-aware transformations
Efficient string builder (minimizes allocations)

// Internal architecture (simplified)
class Preformatter {
  private i = 0;
  private lastCode = 0;
  private pendingSpaces = 0;
  
  process() {
    // Single loop over characters
    while (this.i < this.len) {
      // Classify character using lookup table
      // Apply transformations based on context
      // Emit to output buffer
    }
  }
}

Environment variables

For benchmarking, you can force specific implementations:

# Force string concatenation builder (default)
export BITABOOM_PREFORMAT_BUILDER=concat

# Force UTF-16 buffer builder (for very large texts)
export BITABOOM_PREFORMAT_BUILDER=buffer

The default concat builder is typically faster for page-sized inputs (1-100KB). The buffer builder is optimized for very large inputs (1MB+) to reduce GC pressure.

What it doesn’t do

The preformatter does not include these operations (use dedicated functions):

Character conversion

Urdu symbol conversion (convertUrduSymbolsToArabic)
Arabic numeral conversion (arabicNumeralToNumber)

Content removal

Removing references (removeNonIndexSignatures)
Removing solitary letters (removeSolitaryArabicLetters)
Removing singular codes (removeSingularCodes)
Stripping tags (removeAllTags)

Advanced typography

Smart quotes (basic quote spacing is included)
Title case conversion (toTitleCase)
Styling removal (stripStyling)

Sentence segmentation

Sentence-based formatting (formatStringBySentence)
Line breaks after punctuation (insertLineBreaksAfterPunctuation)

Comparison with individual functions

Use preformat when
Use individual functions when

Processing large volumes of text (>10KB)
Applying multiple formatting operations
Performance is critical (batch processing, real-time)
You need consistent, comprehensive normalization

// Best for:
const formatted = preformatArabicText(largeManuscript);

You only need 1-2 specific transformations
Working with very small strings (under 1KB)
You need fine-grained control
You’re excluding certain normalizations

// Best for:
const result = normalizeSpaces(cleanSpacesBeforePeriod(text));

Best practices

Use batch mode for multiple documents

// ✅ Efficient
const formatted = preformatArabicText(documents);

// ❌ Less efficient
const formatted = documents.map(preformatArabicText);

Apply Urdu conversion first

let text = convertUrduSymbolsToArabic(rawText);
text = preformatArabicText(text);

Combine with content removal as needed

let text = preformatArabicText(rawText);
text = removeNonIndexSignatures(text);
text = removeSolitaryArabicLetters(text);

Validate before expensive operations

const formatted = preformatArabicText(text);
const tokens = estimateTokenCount(formatted, provider);
if (tokens <= maxTokens) {
  await sendToLLM(formatted);
}

The preformatter modifies whitespace, punctuation, and formatting. If you need to preserve exact original formatting, store both the original and formatted versions.

For maximum performance on very large datasets (100MB+), consider using the buffer builder with BITABOOM_PREFORMAT_BUILDER=buffer.

Get Started

Guides

Examples

Preformatting pipeline

Quick start

What it does

Performance

Benchmark results

Batch processing

Real-world patterns

Clean OCR output

Process scanned book pages

Prepare for LLM processing

Create searchable content

Advanced usage

Pipeline with additional steps

Conditional formatting

Streaming processing

Implementation details

Single-pass architecture

Environment variables

What it doesn’t do

Comparison with individual functions

Best practices

Build docs developers (and LLMs) love

Get Started

Guides

Examples

​Quick start

​What it does

​Performance

​Benchmark results

​Batch processing

​Real-world patterns

​Clean OCR output

​Process scanned book pages

​Prepare for LLM processing

​Create searchable content

​Advanced usage

​Pipeline with additional steps

​Conditional formatting

​Streaming processing

​Implementation details

​Single-pass architecture

​Environment variables

​What it doesn’t do

​Comparison with individual functions

​Best practices

Build docs developers (and LLMs) love

Quick start

What it does

Performance

Benchmark results

Batch processing

Real-world patterns

Clean OCR output

Process scanned book pages

Prepare for LLM processing

Create searchable content

Advanced usage

Pipeline with additional steps

Conditional formatting

Streaming processing

Implementation details

Single-pass architecture

Environment variables

What it doesn’t do

Comparison with individual functions

Best practices