preformatArabicText
High-performance Arabic preformatting pipeline that consolidates common formatting steps into a single-pass formatter. Consolidates these operations:- Spacing normalization and insertion
- Punctuation normalization (Arabic/English conversion)
- Reference formatting (slash spacing for numbers)
- Bracket and quote cleanup
- Ellipsis condensation
- Newline normalization
- Redundant character removal
- Smart quote handling
Function signatures
Input string or an array of strings to preformat
string | string[] - Preformatted string or array of strings (matching input shape).
Single string example
Array example
Features
Punctuation normalization:- Converts
?to؟(Arabic question mark) - Converts
;to؛(Arabic semicolon) - Removes redundant punctuation (e.g.,
؟.becomes؟)
- Normalizes multiple spaces to single space
- Removes spaces before punctuation
- Adds spaces after punctuation
- Handles spaces around brackets and quotes
- Fixes reference formatting (e.g.,
1 / 2becomes1/2)
- Condenses multiple underscores/tatweel:
ـــ→ـ - Condenses multiple dashes:
---→- - Condenses multiple asterisks:
***→* - Converts
..to ellipsis… - Condenses colons:
.:→:
- Converts
((text))to«text» - Fixes mismatched brackets and quotes
- Removes spaces inside brackets/quotes
- Normalizes multiple newlines
- Cleans horizontal whitespace from line ends
- Removes trailing/leading whitespace
Performance characteristics
Single-pass algorithm:- O(N) time complexity where N is input length
- Minimal memory allocations
- Uses lookup table for character classification
- Processes UTF-16 code units directly
- Page-sized inputs (typical documents)
- Very large inputs (100MB+ strings)
- Batch processing of multiple strings
Advanced usage
Environment variable control: You can force a specific internal builder for experiments/benchmarks:The buffer builder is experimental and primarily useful for extremely large inputs (100MB+) where GC pressure may dominate. For typical use cases, the default concat builder is faster.
Use cases
Document preprocessing:Comparison with individual functions
Before (multiple passes):- 5-10x faster on typical inputs
- Significantly lower memory usage
- Simpler code
- Consistent results