Overview
bun_nltk is built on a dual-runtime architecture that combines the performance of native code with the portability of WebAssembly. The core NLP algorithms are implemented in Zig, a modern systems programming language, and exposed to JavaScript/TypeScript through two distinct paths:
- Native FFI bindings (via Bun’s FFI) for maximum performance
- WebAssembly runtime for cross-platform compatibility and browser support
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ TypeScript/JavaScript API │
│ (index.ts) │
└────────────────┬───────────────────────┬────────────────────┘
│ │
│ │
┌──────────▼──────────┐ ┌─────────▼──────────┐
│ Native Runtime │ │ WASM Runtime │
│ (src/native.ts) │ │ (src/wasm.ts) │
└──────────┬──────────┘ └─────────┬──────────┘
│ │
│ │
┌──────────▼──────────┐ ┌─────────▼──────────┐
│ Bun FFI (dlopen) │ │ WebAssembly. │
│ │ │ instantiate │
└──────────┬──────────┘ └─────────┬──────────┘
│ │
│ │
┌──────────▼──────────┐ ┌─────────▼──────────┐
│ Native Binary │ │ WASM Binary │
│ (.so/.dll/.dylib) │ │ (.wasm) │
└──────────┬──────────┘ └─────────┬──────────┘
│ │
└───────────┬───────────┘
│
┌───────────▼────────────┐
│ Zig Core Libraries │
│ (zig/src/core/) │
│ │
│ • ascii.zig │
│ • freqdist.zig │
│ • punkt.zig │
│ • porter.zig │
│ • perceptron.zig │
│ • lm.zig │
│ • chunk.zig │
│ • ...and more │
└────────────────────────┘
Core Components
Zig Core Library
The foundation of bun_nltk is a single Zig codebase (zig/src/core/) that implements all NLP algorithms:
- Token processing: ASCII tokenization with SIMD acceleration (
ascii.zig)
- Statistical analysis: Frequency distributions, n-grams, collocations (
freqdist.zig, ngrams.zig, collocations.zig)
- Text normalization: Stopword removal, stemming (
normalize.zig, porter.zig)
- Sentence segmentation: Punkt sentence tokenizer (
punkt.zig)
- POS tagging: Perceptron-based tagger (
tagger.zig, perceptron.zig)
- Language modeling: MLE, Lidstone, Kneser-Ney models (
lm.zig)
- Parsing: Chunk parser, CFG parser, CYK recognition (
chunk.zig, cyk.zig)
- Classification: Naive Bayes, linear models (
naive_bayes.zig, linear.zig)
- WordNet integration: Morphy stemmer (
morphy.zig)
FFI Exports (ffi_exports.zig)
Exports native C-compatible functions for Bun’s FFI layer:
export fn bunnltk_count_tokens_ascii(input_ptr: [*]const u8, input_len: usize) u64
export fn bunnltk_fill_token_offsets_ascii(...) u64
export fn bunnltk_porter_stem_ascii(...) u32
// ... 50+ more exports
These functions are loaded via dlopen in src/native.ts:
const lib = dlopen(nativeLibPath, {
bunnltk_count_tokens_ascii: {
args: ["ptr", "usize"],
returns: "u64",
},
// ... more function signatures
});
WASM Exports (wasm_exports.zig)
Exports the same core algorithms for WebAssembly with a different memory model:
export fn bunnltk_wasm_count_tokens_ascii(input_len: u32) u64
export fn bunnltk_wasm_alloc(size: usize) usize
export fn bunnltk_wasm_free(ptr: usize, size: usize) void
The WASM runtime manages its own memory pool for efficient buffer reuse (WasmNltk class in src/wasm.ts).
Binary Distribution
Prebuilt Native Binaries
The npm package ships with prebuilt native binaries for common platforms:
- Linux x64:
native/prebuilt/linux-x64/bun_nltk.so
- Windows x64:
native/prebuilt/win32-x64/bun_nltk.dll
These are loaded automatically based on process.platform and process.arch:
const ext = process.platform === "win32" ? "dll"
: process.platform === "darwin" ? "dylib"
: "so";
const prebuiltLibPath = resolve(
import.meta.dir,
"..",
"native",
"prebuilt",
`${process.platform}-${process.arch}`,
`bun_nltk.${ext}`,
);
Prebuilt binaries are required for native runtime. There is no install-time compilation - the library uses prebuilt binaries only. For unsupported platforms, set BUN_NLTK_NATIVE_LIB to a custom build path.
WASM Binary
A single universal WASM binary is included:
- Location:
native/bun_nltk.wasm
- Size: Optimized for browser/runtime usage (size-gated in CI)
- Compatibility: Works in Node.js, Bun, Deno, browsers
Build Process
Native Build
Compiles Zig source to a native shared library using:
zig build-lib -dynamic -O ReleaseFast \
zig/src/lib.zig \
-femit-bin=native/bun_nltk.so
WASM Build
Compiles Zig to WebAssembly:
zig build-lib -target wasm32-freestanding -O ReleaseSmall \
zig/src/wasm_exports.zig \
-femit-bin=native/bun_nltk.wasm
WASM uses -O ReleaseSmall for size optimization, while native uses -O ReleaseFast for maximum performance.
Why Zig?
bun_nltk uses Zig as its implementation language for several key reasons:
1. Dual-Target Compilation
Zig can compile the same source code to both native binaries and WebAssembly without platform-specific code:
// Same code works for both FFI and WASM targets
pub fn tokenCountAscii(input: []const u8) u64 {
// ... implementation
}
2. Manual Memory Management
Zig gives explicit control over memory allocation, crucial for:
- Zero-copy string processing
- Arena allocators for temporary data
- Predictable performance characteristics
3. SIMD Support
Zig’s @Vector builtin enables portable SIMD code:
if (builtin.cpu.arch == .x86_64) {
return tokenCountAsciiSimd16(input);
}
return tokenCountAsciiScalar(input);
The SIMD path provides 1.22x speedup for token counting on x86_64.
4. C ABI Compatibility
Zig’s export keyword generates C-compatible functions for FFI:
export fn bunnltk_count_tokens_ascii(
input_ptr: [*]const u8,
input_len: usize
) u64 {
return ascii.tokenCountAscii(input_ptr[0..input_len]);
}
Zig compiles to efficient machine code with:
- No garbage collection overhead
- Inline function calls
- Loop unrolling and vectorization
- Direct memory access
Results: 3.64x to 840x faster than Python NLTK (see Performance).
6. Safety
Zig provides:
- Bounds-checked array access (in Debug mode)
- Explicit error handling
- No undefined behavior (when using safe mode)
- Clear distinction between pointers and slices
Error Handling
Both runtimes use a thread-local error code pattern:
threadlocal var last_error: u32 = 0;
pub fn setError(code: u32) void {
last_error = code;
}
TypeScript wrappers check for errors after FFI/WASM calls:
function assertNoNativeError(context: string): void {
const code = lastError();
if (code !== 0) {
throw new Error(`native error code ${code} in ${context}`);
}
}
Memory Model Differences
Native FFI
- JavaScript owns input buffers: TypeScript creates
Uint8Array, passes pointer via ptr()
- Zig allocates output buffers: JavaScript pre-allocates typed arrays for results
- No copying on input: Direct pointer access to V8 memory
const bytes = toBuffer(text); // TextEncoder
const value = lib.symbols.bunnltk_count_tokens_ascii(
ptr(bytes), // Direct pointer to bytes.buffer
bytes.length
);
WASM Runtime
- WASM owns linear memory: All data lives in
WebAssembly.Memory
- Copy input to WASM: JavaScript writes to WASM memory buffer
- Memory pool reuse:
WasmNltk class maintains allocated blocks
private writeInput(text: string): number {
const encoded = this.encoder.encode(text);
const mem = new Uint8Array(this.exports.memory.buffer);
mem.set(encoded, this.inputPtr); // Copy to WASM memory
return encoded.length;
}
Package Structure
bun_nltk/
├── index.ts # Main entry point
├── src/
│ ├── native.ts # FFI runtime wrapper
│ ├── wasm.ts # WASM runtime wrapper
│ └── ... # High-level TypeScript APIs
├── zig/
│ └── src/
│ ├── lib.zig # Native entry point
│ ├── wasm_exports.zig # WASM entry point
│ ├── ffi_exports.zig # FFI exports
│ └── core/ # Shared algorithm implementations
├── native/
│ ├── prebuilt/
│ │ ├── linux-x64/bun_nltk.so
│ │ └── win32-x64/bun_nltk.dll
│ └── bun_nltk.wasm
├── models/ # Trained model files
├── corpora/ # Bundled corpora
└── package.json
Runtime Selection
Applications explicitly choose their runtime:
// Native runtime (default for Node/Bun)
import { countTokensAscii } from 'bun_nltk/src/native';
// WASM runtime (for browsers or cross-platform)
import { WasmNltk } from 'bun_nltk/src/wasm';
const wasm = await WasmNltk.init();
const count = wasm.countTokensAscii(text);
See Native vs WASM for detailed comparison and selection guidance.