Skip to main content

Runtime Comparison

bun_nltk provides two runtime options for the same NLP algorithms:
  1. Native runtime: FFI bindings to platform-specific shared libraries
  2. WASM runtime: Universal WebAssembly binary
Both expose identical functionality from the same Zig source code, but with different performance, compatibility, and deployment characteristics.

Quick Comparison

AspectNative RuntimeWASM Runtime
PerformanceFastest (direct machine code)Fast (near-native with JIT)
Platform supportLinux x64, Windows x64 onlyUniversal (all platforms)
Browser support❌ No✅ Yes
Binary size~150-300 KB per platform~200 KB (universal)
Memory modelZero-copy inputInput copying required
Startup timeInstant (dlopen)Fast (~5-10ms init)
Thread safetyThread-local stateSingle-threaded
DeploymentPrebuilt binaries requiredSingle WASM file

Performance Characteristics

Native Runtime Performance

The native runtime offers maximum performance through direct machine code execution: 64MB Synthetic Dataset Benchmarks:
WorkloadNative (sec)Python (sec)Speedup
Token + unique + ngram + unique ngram2.76710.0713.64x
Top-K PMI collocations2.09023.94511.46x
Porter stemming11.942120.10110.06x
POS tagger19.88082.8494.17x
Streaming FreqDist3.20620.9716.54x
8MB Gate Dataset Benchmarks:
WorkloadNative (sec)Python (sec)Speedup
Punkt tokenizer0.08481.346315.87x
N-gram LM (Kneser-Ney)0.13242.866121.64x
Regexp chunk parser0.00241.5511643x
WordNet lookup + morphy0.00090.083591.55x
Sparse linear logits0.00242.0001840x
Earley parser0.11494.648340.47x

WASM Runtime Performance

WASM performance is close to native for compute-heavy tasks: Native vs WASM (64MB dataset):
WorkloadNative (sec)WASM (sec)WASM Overhead
Token/ngram counting1.7194.1502.4x slower
WASM vs Python (64MB dataset):
WorkloadWASM (sec)Python (sec)Speedup
Token/ngram counting4.15013.2413.19x
WASM is 2-3x slower than native due to memory copying and WebAssembly overhead, but still 3x faster than Python NLTK.

SIMD Performance

The native runtime includes SIMD-accelerated paths (x86_64 only): SIMD Benchmark Results:
OperationSIMD Speedup
countTokensAscii1.22x vs scalar
Normalization (no stopwords)2.73x vs scalar
WASM runtime uses scalar fallback (WASM SIMD not yet implemented).

API Usage Comparison

Native Runtime API

import { 
  countTokensAscii,
  tokenizeAsciiNative,
  sentenceTokenizePunktAsciiNative,
  porterStemAscii
} from 'bun_nltk/src/native';

// Direct function calls - no initialization needed
const text = "This is a sample text. It has sentences.";

const tokenCount = countTokensAscii(text);
// => 8

const tokens = tokenizeAsciiNative(text);
// => ['this', 'is', 'a', 'sample', 'text', 'it', 'has', 'sentences']

const sentences = sentenceTokenizePunktAsciiNative(text);
// => ['This is a sample text.', 'It has sentences.']

const stem = porterStemAscii('running');
// => 'run'

WASM Runtime API

import { WasmNltk } from 'bun_nltk/src/wasm';

// Initialization required
const wasm = await WasmNltk.init();

const text = "This is a sample text. It has sentences.";

// Same functionality, class-based API
const tokenCount = wasm.countTokensAscii(text);
// => 8

const tokens = wasm.tokenizeAscii(text);
// => ['this', 'is', 'a', 'sample', 'text', 'it', 'has', 'sentences']

const sentences = wasm.sentenceTokenizePunktAscii(text);
// => ['This is a sample text.', 'It has sentences.']

// Cleanup when done
wasm.dispose();
The WASM API uses a memory pool that persists across calls. Call dispose() to free resources when finished.

When to Use Native Runtime

✅ Use Native When:

  1. Maximum performance is critical
    • Processing large corpora (>100MB)
    • Real-time NLP pipelines
    • Batch processing workloads
  2. Running on supported platforms
    • Linux x64 servers
    • Windows x64 development machines
    • CI/CD environments with prebuilt binary support
  3. Node.js/Bun server applications
    • Backend APIs
    • ETL pipelines
    • Data processing scripts
  4. SIMD acceleration available
    • x86_64 processors (Intel/AMD)
    • Workloads that benefit from vectorization

Example: High-Throughput Server

import { 
  tokenizeAsciiNative,
  posTagAsciiNative,
  sentenceTokenizePunktAsciiNative
} from 'bun_nltk/src/native';

// Express/Fastify/Bun route handler
app.post('/analyze', async (req, res) => {
  const { text } = req.body;
  
  // Native runtime for maximum throughput
  const sentences = sentenceTokenizePunktAsciiNative(text);
  const tokens = tokenizeAsciiNative(text);
  const posTags = posTagAsciiNative(text);
  
  res.json({ sentences, tokens, posTags });
});

When to Use WASM Runtime

✅ Use WASM When:

  1. Browser/edge deployment
    • Client-side NLP processing
    • Cloudflare Workers
    • Deno Deploy
    • Browser extensions
  2. Cross-platform compatibility required
    • Unsupported architectures (ARM, macOS ARM64)
    • Multi-platform distribution
    • Environments without native binary support
  3. Sandboxed environments
    • Security-critical applications
    • Untrusted code execution
    • Serverless functions with limited FFI support
  4. Small to medium datasets
    • Less than 10MB text processing
    • Interactive user input analysis
    • Real-time UI features

Example: Browser-Based Text Analysis

import { WasmNltk } from 'bun_nltk/src/wasm';

let nltkWasm: WasmNltk | null = null;

// Initialize on page load
async function initNLP() {
  nltkWasm = await WasmNltk.init();
  console.log('NLP ready!');
}

// Analyze user input
function analyzeText(text: string) {
  if (!nltkWasm) return;
  
  const sentences = nltkWasm.sentenceTokenizePunktAscii(text);
  const tokens = nltkWasm.tokenizeAscii(text);
  const metrics = nltkWasm.computeAsciiMetrics(text, 2);
  
  return { sentences, tokens, metrics };
}

// Cleanup on page unload
window.addEventListener('beforeunload', () => {
  nltkWasm?.dispose();
});

Example: Cloudflare Worker

import { WasmNltk } from 'bun_nltk/src/wasm';

export default {
  async fetch(request: Request): Promise<Response> {
    const wasm = await WasmNltk.init();
    
    const { text } = await request.json();
    const tokens = wasm.tokenizeAscii(text);
    const count = wasm.countTokensAscii(text);
    
    wasm.dispose();
    
    return new Response(JSON.stringify({ tokens, count }));
  },
};

Memory Usage Patterns

Native Runtime Memory Model

Zero-copy input processing:
const text = "large corpus text...";
const bytes = new TextEncoder().encode(text);

// Zig receives pointer to bytes.buffer - no copy
const count = lib.symbols.bunnltk_count_tokens_ascii(
  ptr(bytes),
  bytes.length
);
Pre-allocated output buffers:
// JavaScript allocates output arrays
const capacity = 10000;
const offsets = new Uint32Array(capacity);
const lengths = new Uint32Array(capacity);

// Zig fills pre-allocated buffers
lib.symbols.bunnltk_fill_token_offsets_ascii(
  ptr(bytes),
  bytes.length,
  ptr(offsets),
  ptr(lengths),
  capacity
);

WASM Runtime Memory Model

Memory pool reuse:
private ensureBlock(key: string, bytes: number): PoolBlock {
  const existing = this.blocks.get(key);
  if (existing && existing.bytes >= bytes) return existing;
  
  // Allocate new block if needed
  const ptr = this.exports.bunnltk_wasm_alloc(bytes);
  const block = { ptr, bytes };
  this.blocks.set(key, block);
  return block;
}
The WASM runtime reuses memory blocks across calls to minimize allocation overhead:
// First call allocates
wasm.tokenizeAscii("short text");

// Second call reuses same memory
wasm.tokenizeAscii("another short text");

// Larger input triggers reallocation
wasm.tokenizeAscii("much longer text...");

Platform Support Matrix

Native Runtime Support

PlatformArchitectureStatusBinary Location
Linuxx64✅ Supportednative/prebuilt/linux-x64/bun_nltk.so
Windowsx64✅ Supportednative/prebuilt/win32-x64/bun_nltk.dll
macOSx64❌ Build manuallySet BUN_NLTK_NATIVE_LIB
macOSARM64❌ Build manuallySet BUN_NLTK_NATIVE_LIB
LinuxARM64❌ Build manuallySet BUN_NLTK_NATIVE_LIB
Custom build for unsupported platforms:
# Build native library
bun run build:zig

# Set environment variable
export BUN_NLTK_NATIVE_LIB=/path/to/custom/bun_nltk.so

# Use normally
node your-script.js

WASM Runtime Support

EnvironmentStatusNotes
Node.js 16+✅ SupportedNative WebAssembly API
Bun 1.0+✅ SupportedNative WebAssembly API
Deno 1.0+✅ SupportedNative WebAssembly API
Chrome/Edge✅ SupportedTested in CI
Firefox✅ SupportedTested in CI
Safari✅ SupportedShould work (not CI-tested)
Cloudflare Workers✅ SupportedWASM runtime available

Browser WASM Example

Complete browser integration:
<!DOCTYPE html>
<html>
<head>
  <title>NLP in Browser</title>
</head>
<body>
  <textarea id="input" rows="10" cols="50"></textarea>
  <button id="analyze">Analyze</button>
  <div id="output"></div>

  <script type="module">
    import { WasmNltk } from './node_modules/bun_nltk/src/wasm.ts';
    
    let nltk;
    
    async function init() {
      nltk = await WasmNltk.init();
      console.log('NLTK loaded!');
    }
    
    document.getElementById('analyze').addEventListener('click', () => {
      const text = document.getElementById('input').value;
      
      const sentences = nltk.sentenceTokenizePunktAscii(text);
      const tokens = nltk.tokenizeAscii(text);
      const metrics = nltk.computeAsciiMetrics(text, 2);
      
      document.getElementById('output').innerHTML = `
        <h3>Results:</h3>
        <p>Sentences: ${sentences.length}</p>
        <p>Tokens: ${metrics.tokens}</p>
        <p>Unique tokens: ${metrics.uniqueTokens}</p>
        <p>Bigrams: ${metrics.ngrams}</p>
        <p>Unique bigrams: ${metrics.uniqueNgrams}</p>
      `;
    });
    
    init();
  </script>
</body>
</html>

Migration Guide

Switching from Native to WASM

// Before (Native)
import { countTokensAscii } from 'bun_nltk/src/native';
const count = countTokensAscii(text);

// After (WASM)
import { WasmNltk } from 'bun_nltk/src/wasm';
const wasm = await WasmNltk.init();
const count = wasm.countTokensAscii(text);
wasm.dispose(); // Remember cleanup

Switching from WASM to Native

// Before (WASM)
import { WasmNltk } from 'bun_nltk/src/wasm';
const wasm = await WasmNltk.init();
const tokens = wasm.tokenizeAscii(text);
wasm.dispose();

// After (Native)
import { tokenizeAsciiNative } from 'bun_nltk/src/native';
const tokens = tokenizeAsciiNative(text); // No init/cleanup

Performance Tuning Tips

Native Runtime Optimization

  1. Batch operations: Avoid repeated FFI calls
    // ❌ Slow: Multiple FFI calls
    const stems = tokens.map(t => porterStemAscii(t));
    
    // ✅ Fast: Single batch call
    const stems = porterStemAsciiTokens(tokens);
    
  2. Reuse allocations: Pre-size output buffers
    const capacity = countTokensAscii(text);
    const offsets = new Uint32Array(capacity); // Exact size
    
  3. Use native metrics: Get multiple counts in one call
    // ✅ Single FFI call for 4 metrics
    const metrics = computeAsciiMetrics(text, 2);
    

WASM Runtime Optimization

  1. Initialize once: Reuse WasmNltk instance
    const wasm = await WasmNltk.init();
    // Use for multiple operations
    wasm.dispose(); // Only at end
    
  2. Smaller batches: WASM has memory limits
    // Process in chunks if text > 10MB
    const chunkSize = 1024 * 1024; // 1MB
    for (let i = 0; i < text.length; i += chunkSize) {
      const chunk = text.slice(i, i + chunkSize);
      const result = wasm.tokenizeAscii(chunk);
    }
    
  3. Custom WASM path: Optimize loading
    const wasm = await WasmNltk.init({
      wasmPath: '/static/bun_nltk.wasm' // CDN or local
    });
    

Recommendation Summary

Default recommendation: Use native runtime for Node.js/Bun server applications, and WASM runtime for browsers and edge deployments.
Choose Native if:
  • You need maximum performance (>3x faster than WASM)
  • You’re on Linux x64 or Windows x64
  • You’re processing large datasets (>10MB)
  • You want SIMD acceleration
Choose WASM if:
  • You need browser support
  • You need cross-platform compatibility
  • You’re on an unsupported architecture
  • You’re in a sandboxed environment (Cloudflare Workers, etc.)
  • Your datasets are small (less than 10MB)

Build docs developers (and LLMs) love