Browser Usage

Overview

The bun_nltk library provides full WebAssembly support, allowing you to run NLP operations directly in the browser with near-native performance. The WASM module is self-contained and requires no external dependencies.

Installation

First, install the package:

npm install bun_nltk

Make sure your bundler is configured to handle .wasm files. Most modern bundlers like Vite, Webpack 5+, and Parcel support this out of the box.

Basic Browser Example

<!DOCTYPE html>
<html>
<head>
  <title>bun_nltk Browser Example</title>
</head>
<body>
  <h1>NLP in the Browser</h1>
  <textarea id="input" rows="10" cols="50">
Natural language processing is amazing! It allows computers to understand human text.
This is a second sentence. And here's a third one.
  </textarea>
  <button onclick="processText()">Process</button>
  <div id="output"></div>

  <script type="module">
    import { WasmNltk } from 'bun_nltk/wasm';

    // Initialize the WASM module
    let nltk = null;

    async function initNltk() {
      if (!nltk) {
        // Load WASM bytes (adjust path to your bundler's output)
        const response = await fetch('/path/to/bun_nltk.wasm');
        const wasmBytes = new Uint8Array(await response.arrayBuffer());
        
        nltk = await WasmNltk.init({ wasmBytes });
        console.log('bun_nltk initialized!');
      }
      return nltk;
    }

    window.processText = async function() {
      const nltkInstance = await initNltk();
      const text = document.getElementById('input').value;
      
      // Tokenize
      const tokens = nltkInstance.tokenizeAscii(text);
      
      // Sentence tokenization
      const sentences = nltkInstance.sentenceTokenizePunktAscii(text);
      
      // Compute metrics
      const metrics = nltkInstance.computeAsciiMetrics(text, 2);
      
      // Display results
      const output = document.getElementById('output');
      output.innerHTML = `
        <h2>Results</h2>
        <h3>Tokens (${tokens.length}):</h3>
        <p>${tokens.join(', ')}</p>
        
        <h3>Sentences (${sentences.length}):</h3>
        <ol>
          ${sentences.map(s => `<li>${s}</li>`).join('')}
        </ol>
        
        <h3>Metrics:</h3>
        <ul>
          <li>Total tokens: ${metrics.tokens}</li>
          <li>Unique tokens: ${metrics.uniqueTokens}</li>
          <li>N-grams (2): ${metrics.ngrams}</li>
          <li>Unique n-grams: ${metrics.uniqueNgrams}</li>
        </ul>
      `;
    };
  </script>
</body>
</html>

Initialization Patterns

From URL (Recommended for Production)

import { WasmNltk } from 'bun_nltk/wasm';

// Fetch WASM module from CDN or your server
const response = await fetch('https://cdn.example.com/bun_nltk.wasm');
const wasmBytes = new Uint8Array(await response.arrayBuffer());

const nltk = await WasmNltk.init({ wasmBytes });

From Local Path (Node.js or Bun)

import { WasmNltk } from 'bun_nltk/wasm';

// Automatically loads from default path
const nltk = await WasmNltk.init();

// Or specify a custom path
const nltkCustom = await WasmNltk.init({
  wasmPath: './custom/path/bun_nltk.wasm'
});

Singleton Pattern

For applications that need a single shared instance:

let nltkInstance = null;

export async function getNltk() {
  if (!nltkInstance) {
    const response = await fetch('/bun_nltk.wasm');
    const wasmBytes = new Uint8Array(await response.arrayBuffer());
    nltkInstance = await WasmNltk.init({ wasmBytes });
  }
  return nltkInstance;
}

// Usage
const nltk = await getNltk();
const tokens = nltk.tokenizeAscii('Hello world!');

Memory Management

Memory Pool Reuse

The WasmNltk class uses an internal memory pool to avoid repeated allocations. Memory blocks are automatically reused across operations:

const nltk = await WasmNltk.init({ wasmBytes });

// First call allocates memory for "offsets" block
const tokens1 = nltk.tokenizeAscii('First text');

// Second call reuses the same memory block if large enough
const tokens2 = nltk.tokenizeAscii('Second text');

// Larger input may trigger reallocation
const tokens3 = nltk.tokenizeAscii('Much longer text...'.repeat(100));

Memory Blocks

The following memory blocks are managed internally:

offsets / lengths - Token offset arrays
norm_offsets / norm_lengths - Normalized token arrays
sent_offsets / sent_lengths - Sentence offset arrays
metrics - Metric computation output
perceptron_* - POS tagging arrays
lm_* - Language model arrays
chunk_* - Chunking arrays
cyk_* - Parser arrays
nb_* - Naive Bayes arrays

Manual Cleanup

When you’re done with the WASM instance, call dispose() to free all allocated memory:

const nltk = await WasmNltk.init({ wasmBytes });

// Use the instance
const tokens = nltk.tokenizeAscii('Hello world');

// Clean up when done
nltk.dispose();

// Don't use nltk after dispose() - create a new instance if needed

Best Practices

Reuse instances: Create one instance and reuse it across operations
Dispose when done: Call dispose() when the instance is no longer needed
Avoid repeated initialization: Initialize once and cache the instance
Input size limits: The input buffer has a fixed capacity (check with errors)

Error Handling

The WASM module tracks error codes internally:

try {
  const nltk = await WasmNltk.init({ wasmBytes });
  const tokens = nltk.tokenizeAscii('Some text');
} catch (error) {
  // Errors include context information
  console.error('WASM error:', error.message);
  // Example: "wasm error code 1 in countNgramsAscii"
}

Input Buffer Limits

The WASM module uses a pre-allocated input buffer:

const nltk = await WasmNltk.init({ wasmBytes });

// This will throw if text is too large
try {
  const veryLongText = 'word '.repeat(1000000);
  nltk.tokenizeAscii(veryLongText);
} catch (error) {
  console.error(error.message);
  // "input too large for wasm input buffer: 5000000 > 1048576"
}

For large texts, chunk them before processing:

function chunkText(text, maxSize = 100000) {
  const chunks = [];
  for (let i = 0; i < text.length; i += maxSize) {
    chunks.push(text.slice(i, i + maxSize));
  }
  return chunks;
}

const chunks = chunkText(veryLongText);
const allTokens = [];

for (const chunk of chunks) {
  allTokens.push(...nltk.tokenizeAscii(chunk));
}

Complete React Example

import React, { useState, useEffect } from 'react';
import { WasmNltk } from 'bun_nltk/wasm';

function NlpProcessor() {
  const [nltk, setNltk] = useState(null);
  const [text, setText] = useState('');
  const [results, setResults] = useState(null);
  const [loading, setLoading] = useState(true);

  useEffect(() => {
    async function initWasm() {
      try {
        const response = await fetch('/bun_nltk.wasm');
        const wasmBytes = new Uint8Array(await response.arrayBuffer());
        const instance = await WasmNltk.init({ wasmBytes });
        setNltk(instance);
        setLoading(false);
      } catch (error) {
        console.error('Failed to initialize WASM:', error);
      }
    }
    initWasm();

    // Cleanup on unmount
    return () => {
      if (nltk) {
        nltk.dispose();
      }
    };
  }, []);

  const processText = () => {
    if (!nltk || !text) return;

    const tokens = nltk.tokenizeAscii(text);
    const sentences = nltk.sentenceTokenizePunktAscii(text);
    const metrics = nltk.computeAsciiMetrics(text, 2);
    const normalized = nltk.normalizeTokensAscii(text, true);

    setResults({ tokens, sentences, metrics, normalized });
  };

  if (loading) {
    return <div>Loading WASM module...</div>;
  }

  return (
    <div>
      <h1>NLP Processor</h1>
      <textarea
        value={text}
        onChange={(e) => setText(e.target.value)}
        rows={10}
        cols={50}
        placeholder="Enter text to process..."
      />
      <button onClick={processText}>Process</button>
      
      {results && (
        <div>
          <h2>Results</h2>
          <p><strong>Tokens:</strong> {results.tokens.join(', ')}</p>
          <p><strong>Sentences:</strong> {results.sentences.length}</p>
          <p><strong>Normalized:</strong> {results.normalized.join(', ')}</p>
          <pre>{JSON.stringify(results.metrics, null, 2)}</pre>
        </div>
      )}
    </div>
  );
}

export default NlpProcessor;

Browser Compatibility

Chrome/Edge: Full support (v57+)
Firefox: Full support (v52+)
Safari: Full support (v11+)
Node.js: v12+ (with --experimental-wasm-modules flag in older versions)
Bun: Full native support

Next Steps

Runtime API Reference - Detailed API documentation
Tokenization Guide - Learn about tokenization methods
POS Tagging - Part-of-speech tagging guide
Text Classification - Classification algorithms

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

Overview

Installation

Basic Browser Example

Initialization Patterns

From URL (Recommended for Production)

From Local Path (Node.js or Bun)

Singleton Pattern

Memory Management

Memory Pool Reuse

Memory Blocks

Manual Cleanup

Best Practices

Error Handling

Input Buffer Limits

Complete React Example

Browser Compatibility

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​Overview

​Installation

​Basic Browser Example

​Initialization Patterns

​From URL (Recommended for Production)

​From Local Path (Node.js or Bun)

​Singleton Pattern

​Memory Management

​Memory Pool Reuse

​Memory Blocks

​Manual Cleanup

​Best Practices

​Error Handling

​Input Buffer Limits

​Complete React Example

​Browser Compatibility

​Next Steps

Build docs developers (and LLMs) love

Overview

Installation

Basic Browser Example

Initialization Patterns

From URL (Recommended for Production)

From Local Path (Node.js or Bun)

Singleton Pattern

Memory Management

Memory Pool Reuse

Memory Blocks

Manual Cleanup

Best Practices

Error Handling

Input Buffer Limits

Complete React Example

Browser Compatibility

Next Steps