wordTokenizeSubset
Tokenize text into words with support for common English contractions.The text to tokenize
Array of word tokens with contractions split appropriately
Contraction Handling
The tokenizer splits common contractions:n't→ separate token (e.g., “can’t” → “ca”, “n’t”)'s,'m,'d,'re,'ve,'ll→ separate tokens (e.g., “it’s” → “it”, “‘s”)
tweetTokenizeSubset
Tokenize social media text with support for URLs, hashtags, mentions, and emojis.The social media text to tokenize
Optional configuration for tweet tokenization
Remove @mentions from the output
Reduce repeated characters to maximum of 3 (e.g., “coooool” → “coool”)
Detect and tokenize phone numbers as single tokens
Array of tokens including URLs, hashtags, mentions, and emojis
Detected Patterns
- URLs:
http://andhttps://links - Hashtags:
#tagformat - Mentions:
@usernameformat - Emojis: Unicode emoji sequences
- Phone numbers: Various formats (optional)