Bitaboom provides comprehensive functions for cleaning and sanitizing text, particularly useful for processing Arabic documents, hadith collections, and scholarly works.
Removing symbols and references
Cleaning part references
Remove various symbols, part references, and numerical markers:
import { cleanSymbolsAndPartReferences } from 'bitaboom';
cleanSymbolsAndPartReferences('Another example [1] [1/2]');
// Result: 'Another example '
cleanSymbolsAndPartReferences('Part references 1/2 2/3/4');
// Result: 'Part references '
Removing trailing page numbers
import { cleanTrailingPageNumbers } from 'bitaboom';
cleanTrailingPageNumbers('This is some -[46]- text');
// Result: 'This is some text'
Removing single digit references
import { removeSingleDigitReferences } from 'bitaboom';
removeSingleDigitReferences('Ref (1), Ref «2», Ref [3]');
// Result: 'Ref , Ref , Ref '
Text normalization
Replacing line breaks with spaces
import { replaceLineBreaksWithSpaces } from 'bitaboom';
replaceLineBreaksWithSpaces('a\nb');
// Result: 'a b'
Stripping digits
import { stripAllDigits } from 'bitaboom';
stripAllDigits('abcd245');
// Result: 'abcd'
Removing death year references
import { removeDeathYear } from 'bitaboom';
removeDeathYear('Sufyān ibn 'Uyaynah (d. 198h) said:');
// Result: 'Sufyān ibn 'Uyaynah said:'
removeDeathYear('Sufyān ibn 'Uyaynah [d. 200H] said:');
// Result: 'Sufyān ibn 'Uyaynah said:'
The removeDeathYear function only removes the abbreviated form (d. XXXh) or [d. XXXh], not the full word “died” to avoid false positives.
Strip Markdown syntax while preserving the content:
import { removeMarkdownFormatting } from 'bitaboom';
removeMarkdownFormatting('This is **bold** text');
// Result: 'This is bold text'
removeMarkdownFormatting('This is *italic* text');
// Result: 'This is italic text'
removeMarkdownFormatting('# Header 1');
// Result: 'Header 1'
removeMarkdownFormatting('- Item 1');
// Result: 'Item 1'
Complex Markdown
const input = `# Title
- Item 1
- Item 2
This is **bold** and *italic*`;
const expected = `Title
Item 1
Item 2
This is bold and italic`;
removeMarkdownFormatting(input);
// Result: expected
Text truncation
Simple truncation
import { truncate } from 'bitaboom';
truncate('123456', 5);
// Result: '1234…'
truncate('test'); // Default max: 150
// Result: 'test'
Middle truncation
Preserve both start and end of text:
import { truncateMiddle } from 'bitaboom';
truncateMiddle('The quick brown fox jumps right over the lazy dog', 20);
// Result: 'The quick bro…zy dog'
truncateMiddle('The quick brown fox jumps right over the lazy dog', 25, 8);
// Result: 'The quick brown …lazy dog'
Default behavior
Custom end length
// End length defaults to 1/3 of max length (minimum 3)
truncateMiddle('abcdefghijklmnopqrstuvwxyz', 15);
// Result: 'abcdefghi…vwxyz'
// Preserve exactly 3 characters at the end
truncateMiddle('abcdefghijklmnopqrstuvwxyz', 10, 3);
// Result: 'abcdef…xyz'
URL removal
import { removeUrls } from 'bitaboom';
removeUrls('It should remove both https://abc.com and http://google.com from this');
// Result: 'It should remove both and from this'
Cleaning escaped spaces
Useful for processing file paths:
import { unescapeSpaces } from 'bitaboom';
unescapeSpaces('My\\ Folder\\ Name');
// Result: 'My Folder Name'
unescapeSpaces(' /path/to/My\\ Document.txt ');
// Result: '/path/to/My Document.txt'
Diacritic-insensitive pattern generation
Create regex patterns that match Arabic text regardless of diacritics:
import { makeDiacriticInsensitive } from 'bitaboom';
const pattern = makeDiacriticInsensitive('مرحبا');
const regex = new RegExp(pattern);
regex.test('مرحبا'); // true (original)
regex.test('مرحبأ'); // true (different alif variant)
regex.test('مَرْحَبَا'); // true (with diacritics)
Character equivalences
The pattern handles common Arabic character variants:
// Alif variants: ا, آ, أ, إ
const pattern1 = makeDiacriticInsensitive('ا');
const pattern2 = makeDiacriticInsensitive('آ');
// Both create: '[\u0627\u0622\u0623\u0625][\u064B...]*'
// Ta marbuta ↔ ha: ة ↔ ه
const pattern3 = makeDiacriticInsensitive('ة');
const pattern4 = makeDiacriticInsensitive('ه');
// Both create: '[\u0629\u0647][\u064B...]*'
// Ya variants: ى ↔ ي
const pattern5 = makeDiacriticInsensitive('ى');
const pattern6 = makeDiacriticInsensitive('ي');
// Both create: '[\u0649\u064A][\u064B...]*'
Comprehensive example
Combine multiple sanitization functions:
import {
cleanSymbolsAndPartReferences,
removeDeathYear,
removeMarkdownFormatting,
replaceSalutationsWithSymbol,
unescapeSpaces,
removeUrls
} from 'bitaboom';
function sanitizeScholarlyText(text: string): string {
let result = text;
// Remove URLs
result = removeUrls(result);
// Clean escaped spaces
result = unescapeSpaces(result);
// Remove Markdown formatting
result = removeMarkdownFormatting(result);
// Replace salutations
result = replaceSalutationsWithSymbol(result);
// Remove death year references
result = removeDeathYear(result);
// Clean symbols and references
result = cleanSymbolsAndPartReferences(result);
return result.trim();
}
const rawText = `
# Scholar Biography
Imām al-Bukhārī (d. 256H) compiled the most authentic hadith collection.
The Prophet **sallallahu alayhi wasallam** said...
Reference: [1] See https://example.com for more.
`;
const cleaned = sanitizeScholarlyText(rawText);
console.log(cleaned);
// Output:
// Scholar Biography
// Imām al-Bukhārī compiled the most authentic hadith collection.
// The Prophet ﷺ said...
// Reference: See for more.
Whitespace normalization
Zero-width characters
The diacritic functions automatically handle zero-width joiners:
const textWithZWJ = 'مر\u200Dحبا';
const textWithZWNJ = 'مر\u200Cحبا';
const normalText = 'مرحبا';
const result1 = makeDiacriticInsensitive(textWithZWJ);
const result2 = makeDiacriticInsensitive(textWithZWNJ);
const result3 = makeDiacriticInsensitive(normalText);
// All three produce the same pattern
result1 === result3; // true
result2 === result3; // true
Collapsing whitespace
Multiple spaces are automatically normalized:
const result1 = makeDiacriticInsensitive('مرحبا بكم');
const result2 = makeDiacriticInsensitive('مرحبا بكم');
// Both produce the same pattern (spaces collapsed)
result1 === result2; // true
Special characters handling
const result = makeDiacriticInsensitive('test.+*?');
// Special regex chars are escaped: 'test\\.\\+\\*\\?'
Mixed content
const result = makeDiacriticInsensitive('hello مرحبا');
// Pattern handles both Latin and Arabic characters
const regex = new RegExp(result);
regex.test('hello مرحبا'); // true
regex.test('hello مَرْحَبَا'); // true
All sanitization functions handle edge cases like empty strings, null values, and Unicode normalization (NFC) to ensure consistent results.
- Chain operations efficiently: Apply the most selective filters first
- Use specialized functions:
stripAllDigits is faster than a generic regex for removing numbers
- Cache regex patterns: When using
makeDiacriticInsensitive, cache the resulting regex for reuse
- Consider input size: Functions like
truncate and truncateMiddle are useful for limiting processing overhead
// Good: Cache the regex
const searchRegex = new RegExp(makeDiacriticInsensitive('search term'));
largeArray.filter(item => searchRegex.test(item));
// Less efficient: Recreate regex each time
largeArray.filter(item =>
new RegExp(makeDiacriticInsensitive('search term')).test(item)
);