Skip to main content

Overview

bun_nltk compiles to WebAssembly for browser and edge runtime environments. The WASM build maintains significant performance advantages over Python while providing near-native speed.

WASM vs Native vs Python

Three-way comparison on the 64MB synthetic dataset:
bun run bench:compare:wasm
RuntimeToken/N-gram Operations (sec)Speedup vs Python
Zig Native (via Bun FFI)1.7197.70x
Zig WASM4.1503.19x
Python NLTK13.2411.00x (baseline)

Key Insights

WASM Performance: 3.19x faster than Python
  • Still significantly faster than Python baseline
  • Overhead from WASM runtime is manageable
  • Good choice for browser/edge deployments
Native Performance: 7.70x faster than Python
  • Best performance for server-side workloads
  • Direct memory access via Bun FFI
  • SIMD optimizations enabled
WASM vs Native Gap: ~2.4x slower than native
  • WASM overhead from sandboxing and linear memory
  • No SIMD in WASM build (uses scalar fallback)
  • Still provides excellent absolute performance

Browser WASM Benchmarks

bun_nltk includes automated browser benchmarks in CI:
bun run bench:browser:wasm

Test Environment

Browsers Tested:
  • Chromium (headless)
  • Firefox (headless)
Workloads:
  • Token counting and n-gram operations
  • Punkt sentence tokenization
  • Language model evaluation
  • Chunk parsing (IOB)
  • WordNet morphology

Browser Performance

Browser WASM benchmarks run in CI with strict mode enforcement. Each workload has per-browser thresholds to catch performance regressions. Memory Management:
  • WASM memory pool reuse via WasmNltk wrapper
  • Reduced allocation overhead for repeated operations
  • Explicit disposal for memory cleanup

WASM API Usage

Initialization

import { WasmNltk } from 'bun_nltk';

// Initialize WASM runtime
const wasm = await WasmNltk.init();

// Or provide custom WASM bytes
const wasm = await WasmNltk.init({
  wasmBytes: await fetch('/path/to/bun_nltk.wasm').then(r => r.arrayBuffer()),
});

Token Operations

// Count tokens
const count = wasm.countTokensAscii(text);

// Count n-grams
const bigramCount = wasm.countNgramsAscii(text, 2);

// Batch metrics
const metrics = wasm.computeAsciiMetrics(text, 3);
console.log(metrics.tokens, metrics.uniqueTokens);

Text Processing

// Tokenize
const tokens = wasm.tokenizeAscii(text);

// Normalize with stopword removal
const normalized = wasm.normalizeTokensAscii(text, true);

// Sentence tokenization (Punkt)
const sentences = wasm.sentenceTokenizePunktAscii(text);

WordNet Morphology

// Get base form
const lemma = wasm.wordnetMorphyAscii('running', 'v');
console.log(lemma); // 'run'

Advanced Operations

// POS tagging (Perceptron)
const tagIds = wasm.perceptronPredictBatch(
  featureIds,
  tokenOffsets,
  weights,
  modelFeatureCount,
  tagCount
);

// Language model evaluation
const result = wasm.evaluateLanguageModelIds({
  tokenIds,
  sentenceOffsets,
  order: 3,
  model: 2, // Kneser-Ney
  discount: 0.75,
  vocabSize,
  probeContextFlat,
  probeContextLens,
  probeWords,
  perplexityTokens,
});

// Chunk parsing (IOB)
const chunks = wasm.chunkIobIds({
  tokenTagIds,
  atomAllowedOffsets,
  atomAllowedLengths,
  atomAllowedFlat,
  atomMins,
  atomMaxs,
  ruleAtomOffsets,
  ruleAtomCounts,
  ruleLabelIds,
});

Cleanup

// Dispose WASM instance when done
wasm.dispose();

WASM Binary Size

The WASM build is optimized for browser delivery:
bun run build:wasm          # Build WASM
bun run wasm:size:check     # Check size budget
Build Configuration:
  • ReleaseSmall optimization mode
  • Stripped debug symbols
  • Minimal runtime overhead
Size Budget: CI enforces WASM binary size limits to ensure fast browser loading.

Browser Performance Tips

1. Reuse WASM Instance

// Good: Single instance, multiple operations
const wasm = await WasmNltk.init();
for (const text of texts) {
  const count = wasm.countTokensAscii(text);
}
wasm.dispose();

// Bad: Reinitializing for each operation
for (const text of texts) {
  const wasm = await WasmNltk.init();
  const count = wasm.countTokensAscii(text);
  wasm.dispose();
}

2. Batch Operations

// Use batch APIs when available
const metrics = wasm.computeAsciiMetrics(text, 3);
// Returns: { tokens, uniqueTokens, ngrams, uniqueNgrams }

// Instead of multiple calls
const tokens = wasm.countTokensAscii(text);
const ngrams = wasm.countNgramsAscii(text, 3);

3. Lazy Initialization

let wasmInstance: WasmNltk | null = null;

async function getWasm(): Promise<WasmNltk> {
  if (!wasmInstance) {
    wasmInstance = await WasmNltk.init();
  }
  return wasmInstance;
}

4. Preload WASM Module

<!-- Add to HTML head -->
<link rel="modulepreload" href="/node_modules/bun_nltk/native/bun_nltk.wasm" as="fetch" crossorigin>

WASM vs Native Trade-offs

When to Use WASM

Browser/Edge Runtimes
  • Client-side text processing
  • Edge computing (Cloudflare Workers, Deno Deploy)
  • Offline-capable web applications
Portability
  • Platform-agnostic deployment
  • No native binary dependencies
  • Consistent behavior across environments
Security Sandboxing
  • Sandboxed execution environment
  • Memory safety guarantees
  • Limited system access

When to Use Native

Server-Side Workloads
  • Maximum throughput required
  • Bun/Node.js backend services
  • Batch processing pipelines
SIMD Benefits
  • Large text corpora
  • Token-heavy operations
  • High-frequency operations
Memory Efficiency
  • Lower memory overhead
  • Direct memory management
  • Better cache utilization

WASM Feature Parity

The following operations have WASM equivalents:
FeatureNative APIWASM API
Token countingcountTokensAsciiwasm.countTokensAscii
N-gram countingcountNgramsAsciiwasm.countNgramsAscii
TokenizationtokenizeAsciiNativewasm.tokenizeAscii
NormalizationnormalizeTokensAsciiNativewasm.normalizeTokensAscii
Punkt sentence splitsentenceTokenizePunktAsciiNativewasm.sentenceTokenizePunktAscii
WordNet morphywordnetMorphyAsciiNativewasm.wordnetMorphyAscii
Perceptron inferenceperceptronPredictBatchNativewasm.perceptronPredictBatch
LM evaluationevaluateLanguageModelIdsNativewasm.evaluateLanguageModelIds
Chunk IOB parsingchunkIobIdsNativewasm.chunkIobIds

Performance Regression Testing

Browser WASM benchmarks run in CI for every PR:
# .github/workflows/ci.yml
- name: Browser WASM Benchmark
  run: bun run bench:browser:wasm
Validation:
  • Per-workload performance thresholds
  • Cross-browser consistency checks
  • WASM size budget enforcement

Next Steps

Native Benchmarks

See detailed native vs Python comparison

API Reference

Explore WASM API documentation

Build docs developers (and LLMs) love