Tokenization is the process of breaking text into individual tokens (words, numbers, or symbols). bun_nltk provides several tokenizers optimized for different use cases.
Available Tokenizers
ASCII Tokenizer
The fastest tokenizer for ASCII text. Automatically lowercase tokens.
import { tokenizeAsciiNative } from "bun_nltk";
const text = "Hello World! This is a test.";
const tokens = tokenizeAsciiNative(text);
// ["hello", "world", "this", "is", "a", "test"]
Key Features:
- High-performance SIMD implementation
- Matches pattern:
[A-Za-z0-9']+
- Automatically converts to lowercase
- Returns
string[]
Word Tokenizer
Handles contractions and clitics like βnβtβ, ββsβ, ββllβ, etc.
import { wordTokenizeSubset } from "bun_nltk";
const text = "I can't believe it's working!";
const tokens = wordTokenizeSubset(text);
// ["I", "ca", "n't", "believe", "it", "'s", "working", "!"]
Signature:
function wordTokenizeSubset(text: string): string[]
Contraction Handling:
n't β Splits as separate token (e.g., βcanβtβ β βcaβ, βnβtβ)
's, 'm, 'd, 're, 've, 'll β Splits and lowercases (e.g., βitβsβ β βitβ, ββsβ)
Basic Usage
import { wordTokenizeSubset } from "bun_nltk";
const text = "They'll be there soon.";
const tokens = wordTokenizeSubset(text);
console.log(tokens);
// ["They", "'ll", "be", "there", "soon"]
Handling Contractions
const examples = [
"I'm happy", // ["I", "'m", "happy"]
"don't worry", // ["don", "n't", "worry"]
"we've done it", // ["we", "'ve", "done", "it"]
];
for (const text of examples) {
console.log(wordTokenizeSubset(text));
}
Specialized tokenizer for social media text with support for hashtags, mentions, URLs, and emojis.
import { tweetTokenizeSubset } from "bun_nltk";
const tweet = "Check out @bun_nltk! #NLP https://example.com π";
const tokens = tweetTokenizeSubset(tweet);
// ["Check", "out", "@bun_nltk", "!", "#NLP", "https://example.com", "π"]
Signature:
type TweetTokenizerOptions = {
stripHandles?: boolean; // Remove @mentions (default: false)
reduceLen?: boolean; // Reduce repeated characters (default: false)
matchPhoneNumbers?: boolean; // Match phone numbers (default: true)
};
function tweetTokenizeSubset(
text: string,
options?: TweetTokenizerOptions
): string[]
Default Tweet Tokenization
import { tweetTokenizeSubset } from "bun_nltk";
const tweet = "@user Check this out! #awesome https://example.com";
const tokens = tweetTokenizeSubset(tweet);
console.log(tokens);
// ["@user", "Check", "this", "out", "!", "#awesome", "https://example.com"]
Strip Mentions
const tweet = "@alice @bob Hello everyone!";
const tokens = tweetTokenizeSubset(tweet, {
stripHandles: true
});
console.log(tokens);
// ["Hello", "everyone", "!"]
Reduce Character Repetition
const tweet = "Sooooo cooool!!!";
const tokens = tweetTokenizeSubset(tweet, {
reduceLen: true
});
console.log(tokens);
// ["Sooo", "coool", "!!!"]
// Note: Reduces to maximum 3 repetitions for alphabetic characters
Match Phone Numbers
const text = "Call me at 555-123-4567";
const tokens = tweetTokenizeSubset(text, {
matchPhoneNumbers: true
});
console.log(tokens);
// ["Call", "me", "at", "555-123-4567"]
Emoji Support
const tweet = "Great news! ππ #celebration";
const tokens = tweetTokenizeSubset(tweet);
console.log(tokens);
// ["Great", "news", "!", "π", "π", "#celebration"]
Pattern Matching
Tweet Tokenizer Patterns:
- URLs:
https?://\S+
- Hashtags:
#[\w_]+
- Mentions:
@[\w_]+
- Words:
[A-Za-z0-9]+(?:'[A-Za-z0-9]+)?
- Phone Numbers: Various formats (when enabled)
- Emojis: Full Unicode emoji sequences
Common Use Cases
import { tweetTokenizeSubset } from "bun_nltk";
const tweets = [
"Loving #bunjs! So fast β‘",
"@team Check out this amazing library",
"Download now: https://bun.sh"
];
const processedTweets = tweets.map(tweet =>
tweetTokenizeSubset(tweet, { stripHandles: true })
);
Handling Contractions
import { wordTokenizeSubset } from "bun_nltk";
const sentences = [
"I'm going to the store.",
"They've been waiting for hours.",
"It's a beautiful day."
];
const tokenized = sentences.map(wordTokenizeSubset);
import { tokenizeAsciiNative } from "bun_nltk";
const documents = [
"First document text",
"Second document text",
// ... thousands more
];
// Fastest option for ASCII text
const allTokens = documents.map(tokenizeAsciiNative);
- tokenizeAsciiNative: Fastest, uses SIMD optimizations
- wordTokenizeSubset: Moderate speed, handles contractions
- tweetTokenizeSubset: Slower but feature-rich for social media
Use tokenizeAsciiNative when you need maximum speed and donβt require special handling of contractions or social media features.
All tokenizers automatically handle edge cases like multiple spaces, punctuation, and empty strings.