Skip to main content

Introduction to bun_nltk

bun_nltk is a high-performance NLP library that brings the power of native Zig code to JavaScript runtimes. Built for speed and efficiency, it provides essential natural language processing primitives with performance that rivals Python’s NLTK while maintaining a simple, modern API.

Why bun_nltk?

Traditional JavaScript NLP libraries struggle with performance on large-scale text processing tasks. bun_nltk solves this by leveraging:
  • Native Zig Performance: Core operations run in compiled Zig code, delivering 3-643x faster performance than Python NLTK
  • WASM Fallback: Works everywhere with WebAssembly runtime when native binaries aren’t available
  • Zero Dependencies: No complex installation or build steps - just install and use
  • Modern API: Clean TypeScript interfaces with full type safety

Key features

Tokenization

Word and sentence tokenization with PTB-style contractions, tweet tokenization, and trainable Punkt models

Text analysis

Token counting, n-gram generation, frequency distributions, PMI collocation scoring, and SIMD-accelerated operations

POS tagging

Part-of-speech tagging with perceptron models and regex-based heuristic taggers

Text classification

Naive Bayes, decision trees, logistic regression, and linear SVM classifiers with sparse feature vectorization

Parsing

CFG chart parser, PCFG probabilistic parser, Earley parser, and dependency parser with grammar support

Language models

N-gram language models with MLE, Lidstone, and Kneser-Ney interpolation smoothing

WordNet integration

Synset lookup, relation traversal, and morphy-style inflection recovery with packed binary format

Corpus utilities

Corpus reader framework with support for Brown, CoNLL formats, and external corpus bundles

Performance benchmarks

bun_nltk delivers exceptional performance across all operations:
Operationbun_nltkPython NLTKSpeedup
Token + n-gram counting2.77s10.07s3.64x
PMI collocations2.09s23.95s11.46x
Porter stemming11.94s120.10s10.06x
Punkt tokenizer0.08s1.35s15.87x
Chunk parser0.002s1.55s643x
WordNet lookup0.001s0.08s91.55x
Earley parser0.11s4.65s40.47x
Benchmarks run on 8-64MB synthetic datasets. See README for full methodology.

Supported platforms

bun_nltk ships with prebuilt native binaries for:
  • Linux x64
  • Windows x64
All platforms also support the WASM runtime as a fallback, including:
  • macOS (arm64, x64)
  • Browser environments
  • Any JavaScript runtime with WASM support

Get started

Installation

Install bun_nltk with npm, bun, or pnpm

Quickstart

Run your first NLP operations in minutes

API reference

Explore the complete API documentation

Community and support

Build docs developers (and LLMs) love