Text sanitization

Bitaboom provides comprehensive functions for cleaning and sanitizing text, particularly useful for processing Arabic documents, hadith collections, and scholarly works.

Removing symbols and references

Cleaning part references

Remove various symbols, part references, and numerical markers:

import { cleanSymbolsAndPartReferences } from 'bitaboom';

cleanSymbolsAndPartReferences('Another example [1] [1/2]');
// Result: 'Another example  '

cleanSymbolsAndPartReferences('Part references 1/2 2/3/4');
// Result: 'Part references    '

Removing trailing page numbers

import { cleanTrailingPageNumbers } from 'bitaboom';

cleanTrailingPageNumbers('This is some -[46]- text');
// Result: 'This is some  text'

Removing single digit references

import { removeSingleDigitReferences } from 'bitaboom';

removeSingleDigitReferences('Ref (1), Ref «2», Ref [3]');
// Result: 'Ref , Ref , Ref '

Text normalization

Replacing line breaks with spaces

import { replaceLineBreaksWithSpaces } from 'bitaboom';

replaceLineBreaksWithSpaces('a\nb');
// Result: 'a b'

Stripping digits

import { stripAllDigits } from 'bitaboom';

stripAllDigits('abcd245');
// Result: 'abcd'

Removing death year references

import { removeDeathYear } from 'bitaboom';

removeDeathYear('Sufyān ibn 'Uyaynah (d. 198h) said:');
// Result: 'Sufyān ibn 'Uyaynah said:'

removeDeathYear('Sufyān ibn 'Uyaynah [d. 200H] said:');
// Result: 'Sufyān ibn 'Uyaynah said:'

The removeDeathYear function only removes the abbreviated form (d. XXXh) or [d. XXXh], not the full word “died” to avoid false positives.

Markdown formatting removal

Strip Markdown syntax while preserving the content:

import { removeMarkdownFormatting } from 'bitaboom';

removeMarkdownFormatting('This is **bold** text');
// Result: 'This is bold text'

removeMarkdownFormatting('This is *italic* text');
// Result: 'This is italic text'

removeMarkdownFormatting('# Header 1');
// Result: 'Header 1'

removeMarkdownFormatting('- Item 1');
// Result: 'Item 1'

Complex Markdown

const input = `# Title
- Item 1
- Item 2
This is **bold** and *italic*`;

const expected = `Title
Item 1
Item 2
This is bold and italic`;

removeMarkdownFormatting(input);
// Result: expected

Text truncation

Simple truncation

import { truncate } from 'bitaboom';

truncate('123456', 5);
// Result: '1234…'

truncate('test'); // Default max: 150
// Result: 'test'

Middle truncation

Preserve both start and end of text:

import { truncateMiddle } from 'bitaboom';

truncateMiddle('The quick brown fox jumps right over the lazy dog', 20);
// Result: 'The quick bro…zy dog'

truncateMiddle('The quick brown fox jumps right over the lazy dog', 25, 8);
// Result: 'The quick brown …lazy dog'

Default behavior
Custom end length

// End length defaults to 1/3 of max length (minimum 3)
truncateMiddle('abcdefghijklmnopqrstuvwxyz', 15);
// Result: 'abcdefghi…vwxyz'

// Preserve exactly 3 characters at the end
truncateMiddle('abcdefghijklmnopqrstuvwxyz', 10, 3);
// Result: 'abcdef…xyz'

URL removal

import { removeUrls } from 'bitaboom';

removeUrls('It should remove both https://abc.com and http://google.com from this');
// Result: 'It should remove both  and  from this'

Cleaning escaped spaces

Useful for processing file paths:

import { unescapeSpaces } from 'bitaboom';

unescapeSpaces('My\\ Folder\\ Name');
// Result: 'My Folder Name'

unescapeSpaces('  /path/to/My\\ Document.txt  ');
// Result: '/path/to/My Document.txt'

Diacritic-insensitive pattern generation

Create regex patterns that match Arabic text regardless of diacritics:

import { makeDiacriticInsensitive } from 'bitaboom';

const pattern = makeDiacriticInsensitive('مرحبا');
const regex = new RegExp(pattern);

regex.test('مرحبا'); // true (original)
regex.test('مرحبأ'); // true (different alif variant)
regex.test('مَرْحَبَا'); // true (with diacritics)

Character equivalences

The pattern handles common Arabic character variants:

// Alif variants: ا, آ, أ, إ
const pattern1 = makeDiacriticInsensitive('ا');
const pattern2 = makeDiacriticInsensitive('آ');
// Both create: '[\u0627\u0622\u0623\u0625][\u064B...]*'

// Ta marbuta ↔ ha: ة ↔ ه
const pattern3 = makeDiacriticInsensitive('ة');
const pattern4 = makeDiacriticInsensitive('ه');
// Both create: '[\u0629\u0647][\u064B...]*'

// Ya variants: ى ↔ ي
const pattern5 = makeDiacriticInsensitive('ى');
const pattern6 = makeDiacriticInsensitive('ي');
// Both create: '[\u0649\u064A][\u064B...]*'

Comprehensive example

Combine multiple sanitization functions:

import {
  cleanSymbolsAndPartReferences,
  removeDeathYear,
  removeMarkdownFormatting,
  replaceSalutationsWithSymbol,
  unescapeSpaces,
  removeUrls
} from 'bitaboom';

function sanitizeScholarlyText(text: string): string {
  let result = text;
  
  // Remove URLs
  result = removeUrls(result);
  
  // Clean escaped spaces
  result = unescapeSpaces(result);
  
  // Remove Markdown formatting
  result = removeMarkdownFormatting(result);
  
  // Replace salutations
  result = replaceSalutationsWithSymbol(result);
  
  // Remove death year references
  result = removeDeathYear(result);
  
  // Clean symbols and references
  result = cleanSymbolsAndPartReferences(result);
  
  return result.trim();
}

const rawText = `
# Scholar Biography

Imām al-Bukhārī (d. 256H) compiled the most authentic hadith collection.
The Prophet **sallallahu alayhi wasallam** said...

Reference: [1] See https://example.com for more.
`;

const cleaned = sanitizeScholarlyText(rawText);
console.log(cleaned);
// Output:
// Scholar Biography
// Imām al-Bukhārī compiled the most authentic hadith collection.
// The Prophet ﷺ said...
// Reference:  See  for more.

Whitespace normalization

Zero-width characters

The diacritic functions automatically handle zero-width joiners:

const textWithZWJ = 'مر\u200Dحبا';
const textWithZWNJ = 'مر\u200Cحبا';
const normalText = 'مرحبا';

const result1 = makeDiacriticInsensitive(textWithZWJ);
const result2 = makeDiacriticInsensitive(textWithZWNJ);
const result3 = makeDiacriticInsensitive(normalText);

// All three produce the same pattern
result1 === result3; // true
result2 === result3; // true

Collapsing whitespace

Multiple spaces are automatically normalized:

const result1 = makeDiacriticInsensitive('مرحبا   بكم');
const result2 = makeDiacriticInsensitive('مرحبا بكم');
// Both produce the same pattern (spaces collapsed)
result1 === result2; // true

Special characters handling

Escaping regex metacharacters

const result = makeDiacriticInsensitive('test.+*?');
// Special regex chars are escaped: 'test\\.\\+\\*\\?'

Mixed content

const result = makeDiacriticInsensitive('hello مرحبا');
// Pattern handles both Latin and Arabic characters
const regex = new RegExp(result);

regex.test('hello مرحبا'); // true
regex.test('hello مَرْحَبَا'); // true

All sanitization functions handle edge cases like empty strings, null values, and Unicode normalization (NFC) to ensure consistent results.

Performance tips

Chain operations efficiently: Apply the most selective filters first
Use specialized functions: stripAllDigits is faster than a generic regex for removing numbers
Cache regex patterns: When using makeDiacriticInsensitive, cache the resulting regex for reuse
Consider input size: Functions like truncate and truncateMiddle are useful for limiting processing overhead

// Good: Cache the regex
const searchRegex = new RegExp(makeDiacriticInsensitive('search term'));
largeArray.filter(item => searchRegex.test(item));

// Less efficient: Recreate regex each time
largeArray.filter(item => 
  new RegExp(makeDiacriticInsensitive('search term')).test(item)
);

Get Started

Guides

Examples

Text sanitization

Removing symbols and references

Cleaning part references

Removing trailing page numbers

Removing single digit references

Text normalization

Replacing line breaks with spaces

Stripping digits

Removing death year references

Markdown formatting removal

Complex Markdown

Text truncation

Simple truncation

Middle truncation

URL removal

Cleaning escaped spaces

Diacritic-insensitive pattern generation

Character equivalences

Comprehensive example

Whitespace normalization

Zero-width characters

Collapsing whitespace

Special characters handling

Escaping regex metacharacters

Mixed content

Performance tips

Build docs developers (and LLMs) love

Get Started

Guides

Examples

​Removing symbols and references

​Cleaning part references

​Removing trailing page numbers

​Removing single digit references

​Text normalization

​Replacing line breaks with spaces

​Stripping digits

​Removing death year references

​Markdown formatting removal

​Complex Markdown

​Text truncation

​Simple truncation

​Middle truncation

​URL removal

​Cleaning escaped spaces

​Diacritic-insensitive pattern generation

​Character equivalences

​Comprehensive example

​Whitespace normalization

​Zero-width characters

​Collapsing whitespace

​Special characters handling

​Escaping regex metacharacters

​Mixed content

​Performance tips

Build docs developers (and LLMs) love

Removing symbols and references

Cleaning part references

Removing trailing page numbers

Removing single digit references

Text normalization

Replacing line breaks with spaces

Stripping digits

Removing death year references

Markdown formatting removal

Complex Markdown

Text truncation

Simple truncation

Middle truncation

URL removal

Cleaning escaped spaces

Diacritic-insensitive pattern generation

Character equivalences

Comprehensive example

Whitespace normalization

Zero-width characters

Collapsing whitespace

Special characters handling

Escaping regex metacharacters

Mixed content

Performance tips