Architecture

Overview

bun_nltk is built on a dual-runtime architecture that combines the performance of native code with the portability of WebAssembly. The core NLP algorithms are implemented in Zig, a modern systems programming language, and exposed to JavaScript/TypeScript through two distinct paths:

Native FFI bindings (via Bun’s FFI) for maximum performance
WebAssembly runtime for cross-platform compatibility and browser support

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                  TypeScript/JavaScript API                  │
│                      (index.ts)                             │
└────────────────┬───────────────────────┬────────────────────┘
                 │                       │
                 │                       │
      ┌──────────▼──────────┐  ┌─────────▼──────────┐
      │   Native Runtime    │  │   WASM Runtime     │
      │   (src/native.ts)   │  │   (src/wasm.ts)    │
      └──────────┬──────────┘  └─────────┬──────────┘
                 │                       │
                 │                       │
      ┌──────────▼──────────┐  ┌─────────▼──────────┐
      │  Bun FFI (dlopen)   │  │  WebAssembly.      │
      │                     │  │  instantiate       │
      └──────────┬──────────┘  └─────────┬──────────┘
                 │                       │
                 │                       │
      ┌──────────▼──────────┐  ┌─────────▼──────────┐
      │  Native Binary      │  │   WASM Binary      │
      │  (.so/.dll/.dylib)  │  │   (.wasm)          │
      └──────────┬──────────┘  └─────────┬──────────┘
                 │                       │
                 └───────────┬───────────┘
                             │
                 ┌───────────▼────────────┐
                 │   Zig Core Libraries   │
                 │   (zig/src/core/)      │
                 │                        │
                 │  • ascii.zig           │
                 │  • freqdist.zig        │
                 │  • punkt.zig           │
                 │  • porter.zig          │
                 │  • perceptron.zig      │
                 │  • lm.zig              │
                 │  • chunk.zig           │
                 │  • ...and more         │
                 └────────────────────────┘

Core Components

Zig Core Library

The foundation of bun_nltk is a single Zig codebase (zig/src/core/) that implements all NLP algorithms:

Token processing: ASCII tokenization with SIMD acceleration (ascii.zig)
Statistical analysis: Frequency distributions, n-grams, collocations (freqdist.zig, ngrams.zig, collocations.zig)
Text normalization: Stopword removal, stemming (normalize.zig, porter.zig)
Sentence segmentation: Punkt sentence tokenizer (punkt.zig)
POS tagging: Perceptron-based tagger (tagger.zig, perceptron.zig)
Language modeling: MLE, Lidstone, Kneser-Ney models (lm.zig)
Parsing: Chunk parser, CFG parser, CYK recognition (chunk.zig, cyk.zig)
Classification: Naive Bayes, linear models (naive_bayes.zig, linear.zig)
WordNet integration: Morphy stemmer (morphy.zig)

FFI Exports (`ffi_exports.zig`)

Exports native C-compatible functions for Bun’s FFI layer:

export fn bunnltk_count_tokens_ascii(input_ptr: [*]const u8, input_len: usize) u64
export fn bunnltk_fill_token_offsets_ascii(...) u64
export fn bunnltk_porter_stem_ascii(...) u32
// ... 50+ more exports

These functions are loaded via dlopen in src/native.ts:

const lib = dlopen(nativeLibPath, {
  bunnltk_count_tokens_ascii: {
    args: ["ptr", "usize"],
    returns: "u64",
  },
  // ... more function signatures
});

WASM Exports (`wasm_exports.zig`)

Exports the same core algorithms for WebAssembly with a different memory model:

export fn bunnltk_wasm_count_tokens_ascii(input_len: u32) u64
export fn bunnltk_wasm_alloc(size: usize) usize
export fn bunnltk_wasm_free(ptr: usize, size: usize) void

The WASM runtime manages its own memory pool for efficient buffer reuse (WasmNltk class in src/wasm.ts).

Binary Distribution

Prebuilt Native Binaries

The npm package ships with prebuilt native binaries for common platforms:

Linux x64: native/prebuilt/linux-x64/bun_nltk.so
Windows x64: native/prebuilt/win32-x64/bun_nltk.dll

These are loaded automatically based on process.platform and process.arch:

const ext = process.platform === "win32" ? "dll" 
  : process.platform === "darwin" ? "dylib" 
  : "so";

const prebuiltLibPath = resolve(
  import.meta.dir,
  "..",
  "native",
  "prebuilt",
  `${process.platform}-${process.arch}`,
  `bun_nltk.${ext}`,
);

Prebuilt binaries are required for native runtime. There is no install-time compilation - the library uses prebuilt binaries only. For unsupported platforms, set BUN_NLTK_NATIVE_LIB to a custom build path.

WASM Binary

A single universal WASM binary is included:

Location: native/bun_nltk.wasm
Size: Optimized for browser/runtime usage (size-gated in CI)
Compatibility: Works in Node.js, Bun, Deno, browsers

Build Process

Native Build

bun run build:zig

Compiles Zig source to a native shared library using:

zig build-lib -dynamic -O ReleaseFast \
  zig/src/lib.zig \
  -femit-bin=native/bun_nltk.so

WASM Build

bun run build:wasm

Compiles Zig to WebAssembly:

zig build-lib -target wasm32-freestanding -O ReleaseSmall \
  zig/src/wasm_exports.zig \
  -femit-bin=native/bun_nltk.wasm

WASM uses -O ReleaseSmall for size optimization, while native uses -O ReleaseFast for maximum performance.

Why Zig?

bun_nltk uses Zig as its implementation language for several key reasons:

1. Dual-Target Compilation

Zig can compile the same source code to both native binaries and WebAssembly without platform-specific code:

// Same code works for both FFI and WASM targets
pub fn tokenCountAscii(input: []const u8) u64 {
    // ... implementation
}

2. Manual Memory Management

Zig gives explicit control over memory allocation, crucial for:

Zero-copy string processing
Arena allocators for temporary data
Predictable performance characteristics

3. SIMD Support

Zig’s @Vector builtin enables portable SIMD code:

if (builtin.cpu.arch == .x86_64) {
    return tokenCountAsciiSimd16(input);
}
return tokenCountAsciiScalar(input);

The SIMD path provides 1.22x speedup for token counting on x86_64.

4. C ABI Compatibility

Zig’s export keyword generates C-compatible functions for FFI:

export fn bunnltk_count_tokens_ascii(
    input_ptr: [*]const u8,
    input_len: usize
) u64 {
    return ascii.tokenCountAscii(input_ptr[0..input_len]);
}

5. Performance

Zig compiles to efficient machine code with:

No garbage collection overhead
Inline function calls
Loop unrolling and vectorization
Direct memory access

Results: 3.64x to 840x faster than Python NLTK (see Performance).

6. Safety

Zig provides:

Bounds-checked array access (in Debug mode)
Explicit error handling
No undefined behavior (when using safe mode)
Clear distinction between pointers and slices

Error Handling

Both runtimes use a thread-local error code pattern:

threadlocal var last_error: u32 = 0;

pub fn setError(code: u32) void {
    last_error = code;
}

TypeScript wrappers check for errors after FFI/WASM calls:

function assertNoNativeError(context: string): void {
  const code = lastError();
  if (code !== 0) {
    throw new Error(`native error code ${code} in ${context}`);
  }
}

Memory Model Differences

Native FFI

JavaScript owns input buffers: TypeScript creates Uint8Array, passes pointer via ptr()
Zig allocates output buffers: JavaScript pre-allocates typed arrays for results
No copying on input: Direct pointer access to V8 memory

const bytes = toBuffer(text); // TextEncoder
const value = lib.symbols.bunnltk_count_tokens_ascii(
  ptr(bytes),  // Direct pointer to bytes.buffer
  bytes.length
);

WASM Runtime

WASM owns linear memory: All data lives in WebAssembly.Memory
Copy input to WASM: JavaScript writes to WASM memory buffer
Memory pool reuse: WasmNltk class maintains allocated blocks

private writeInput(text: string): number {
  const encoded = this.encoder.encode(text);
  const mem = new Uint8Array(this.exports.memory.buffer);
  mem.set(encoded, this.inputPtr); // Copy to WASM memory
  return encoded.length;
}

Package Structure

bun_nltk/
├── index.ts              # Main entry point
├── src/
│   ├── native.ts         # FFI runtime wrapper
│   ├── wasm.ts           # WASM runtime wrapper
│   └── ...               # High-level TypeScript APIs
├── zig/
│   └── src/
│       ├── lib.zig       # Native entry point
│       ├── wasm_exports.zig  # WASM entry point
│       ├── ffi_exports.zig   # FFI exports
│       └── core/         # Shared algorithm implementations
├── native/
│   ├── prebuilt/
│   │   ├── linux-x64/bun_nltk.so
│   │   └── win32-x64/bun_nltk.dll
│   └── bun_nltk.wasm
├── models/               # Trained model files
├── corpora/              # Bundled corpora
└── package.json

Runtime Selection

Applications explicitly choose their runtime:

// Native runtime (default for Node/Bun)
import { countTokensAscii } from 'bun_nltk/src/native';

// WASM runtime (for browsers or cross-platform)
import { WasmNltk } from 'bun_nltk/src/wasm';
const wasm = await WasmNltk.init();
const count = wasm.countTokensAscii(text);

See Native vs WASM for detailed comparison and selection guidance.

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

Overview

Architecture Diagram

Core Components

Zig Core Library

FFI Exports (`ffi_exports.zig`)

WASM Exports (`wasm_exports.zig`)

Binary Distribution

Prebuilt Native Binaries

WASM Binary

Build Process

Native Build

WASM Build

Why Zig?

1. Dual-Target Compilation

2. Manual Memory Management

3. SIMD Support

4. C ABI Compatibility

5. Performance

6. Safety

Error Handling

Memory Model Differences

Native FFI

WASM Runtime

Package Structure

Runtime Selection

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​Overview

​Architecture Diagram

​Core Components

​Zig Core Library

​FFI Exports (ffi_exports.zig)

​WASM Exports (wasm_exports.zig)

​Binary Distribution

​Prebuilt Native Binaries

​WASM Binary

​Build Process

​Native Build

​WASM Build

​Why Zig?

​1. Dual-Target Compilation

​2. Manual Memory Management

​3. SIMD Support

​4. C ABI Compatibility

​5. Performance

​6. Safety

​Error Handling

​Memory Model Differences

​Native FFI

​WASM Runtime

​Package Structure

​Runtime Selection

Build docs developers (and LLMs) love

Overview

Architecture Diagram

Core Components

Zig Core Library

FFI Exports (`ffi_exports.zig`)

WASM Exports (`wasm_exports.zig`)

Binary Distribution

Prebuilt Native Binaries

WASM Binary

Build Process

Native Build

WASM Build

Why Zig?

1. Dual-Target Compilation

2. Manual Memory Management

3. SIMD Support

4. C ABI Compatibility

5. Performance

6. Safety

Error Handling

Memory Model Differences

Native FFI

WASM Runtime

Package Structure

Runtime Selection