Introduction to bun_nltk
bun_nltk is a high-performance NLP library that brings the power of native Zig code to JavaScript runtimes. Built for speed and efficiency, it provides essential natural language processing primitives with performance that rivals Python’s NLTK while maintaining a simple, modern API.Why bun_nltk?
Traditional JavaScript NLP libraries struggle with performance on large-scale text processing tasks. bun_nltk solves this by leveraging:- Native Zig Performance: Core operations run in compiled Zig code, delivering 3-643x faster performance than Python NLTK
- WASM Fallback: Works everywhere with WebAssembly runtime when native binaries aren’t available
- Zero Dependencies: No complex installation or build steps - just install and use
- Modern API: Clean TypeScript interfaces with full type safety
Key features
Tokenization
Word and sentence tokenization with PTB-style contractions, tweet tokenization, and trainable Punkt models
Text analysis
Token counting, n-gram generation, frequency distributions, PMI collocation scoring, and SIMD-accelerated operations
POS tagging
Part-of-speech tagging with perceptron models and regex-based heuristic taggers
Text classification
Naive Bayes, decision trees, logistic regression, and linear SVM classifiers with sparse feature vectorization
Parsing
CFG chart parser, PCFG probabilistic parser, Earley parser, and dependency parser with grammar support
Language models
N-gram language models with MLE, Lidstone, and Kneser-Ney interpolation smoothing
WordNet integration
Synset lookup, relation traversal, and morphy-style inflection recovery with packed binary format
Corpus utilities
Corpus reader framework with support for Brown, CoNLL formats, and external corpus bundles
Performance benchmarks
bun_nltk delivers exceptional performance across all operations:| Operation | bun_nltk | Python NLTK | Speedup |
|---|---|---|---|
| Token + n-gram counting | 2.77s | 10.07s | 3.64x |
| PMI collocations | 2.09s | 23.95s | 11.46x |
| Porter stemming | 11.94s | 120.10s | 10.06x |
| Punkt tokenizer | 0.08s | 1.35s | 15.87x |
| Chunk parser | 0.002s | 1.55s | 643x |
| WordNet lookup | 0.001s | 0.08s | 91.55x |
| Earley parser | 0.11s | 4.65s | 40.47x |
Supported platforms
bun_nltk ships with prebuilt native binaries for:- Linux x64
- Windows x64
- macOS (arm64, x64)
- Browser environments
- Any JavaScript runtime with WASM support
Get started
Installation
Install bun_nltk with npm, bun, or pnpm
Quickstart
Run your first NLP operations in minutes
API reference
Explore the complete API documentation
Community and support
- GitHub: Seyamalam/bun_nltk
- Issues: Report bugs or request features
- License: Apache 2.0