POS Tagging

Part-of-speech (POS) tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word in text. bun_nltk provides both rule-based and machine learning approaches.

Quick Start

import { posTagAsciiNative } from "bun_nltk";

const text = "The quick brown fox jumps over the lazy dog";
const tags = posTagAsciiNative(text);
// [
//   { token: "The", tag: "DT", tagId: 6, start: 0, length: 3 },
//   { token: "quick", tag: "NN", tagId: 0, start: 4, length: 5 },
//   { token: "brown", tag: "NN", tagId: 0, start: 10, length: 5 },
//   { token: "fox", tag: "NN", tagId: 0, start: 16, length: 3 },
//   { token: "jumps", tag: "NN", tagId: 0, start: 20, length: 5 },
//   { token: "over", tag: "NN", tagId: 0, start: 26, length: 4 },
//   { token: "the", tag: "DT", tagId: 6, start: 31, length: 3 },
//   { token: "lazy", tag: "NN", tagId: 0, start: 35, length: 4 },
//   { token: "dog", tag: "NN", tagId: 0, start: 40, length: 3 }
// ]

Tag Set

bun_nltk uses a simplified Penn Treebank tag set:

Tag	Description	Examples
NN	Noun (common)	cat, tree, idea
NNP	Noun (proper)	London, John, Microsoft
CD	Cardinal number	1, 42, thousand
VBG	Verb (gerund/present participle)	running, eating
VBD	Verb (past tense)	walked, jumped
RB	Adverb	quickly, very
DT	Determiner	the, a, this
CC	Coordinating conjunction	and, or, but
PRP	Personal pronoun	I, you, he, she
VB	Verb (base form)	is, have, do

Fast Rule-Based Tagger

Heuristic tagger with instant results.

Basic Usage

import { posTagAscii } from "bun_nltk";

const text = "She is running quickly";
const tags = posTagAscii(text);
console.log(tags);
// [
//   { token: "She", tag: "PRP", tagId: 8, start: 0, length: 3 },
//   { token: "is", tag: "VB", tagId: 9, start: 4, length: 2 },
//   { token: "running", tag: "VBG", tagId: 3, start: 7, length: 7 },
//   { token: "quickly", tag: "RB", tagId: 5, start: 15, length: 7 }
// ]

Return Type:

type PosTag = {
  token: string;   // Original token
  tag: string;     // POS tag (e.g., "NN", "VB")
  tagId: number;   // Numeric tag ID
  start: number;   // Character offset in original text
  length: number;  // Token length in characters
};

Heuristic Rules

The rule-based tagger uses these patterns:

Numbers

// Pattern: /^\d+$/
"42" → CD
"2024" → CD

Pronouns

// Closed list
"I", "you", "he", "she", "it" → PRP
"me", "him", "her", "us", "them" → PRP

Determiners

// Closed list  
"a", "an", "the" → DT
"this", "that", "these", "those" → DT

Conjunctions

// Closed list
"and", "or", "but", "yet", "nor" → CC

Verb Forms

// Closed list + patterns
"is", "am", "are", "was", "were" → VB
"do", "does", "did", "have", "has", "had" → VB
/ing$/ → VBG
/ed$/ → VBD

Adverbs

// Pattern
/ly$/ → RB
"quickly", "slowly", "carefully" → RB

Proper Nouns

// Capitalization
/^[A-Z]/ && length > 1 → NNP
"London", "Microsoft" → NNP

Common Nouns

// Default fallback
Everything else → NN

Perceptron Tagger

Machine learning-based tagger with higher accuracy.

Using a Pre-trained Model

Load Model

import { loadPerceptronTaggerModel } from "bun_nltk";

// Load default bundled model
const model = loadPerceptronTaggerModel();

// Or load custom model
const customModel = loadPerceptronTaggerModel(
  "./models/custom_tagger.json"
);

Tag Text

import { posTagPerceptronAscii } from "bun_nltk";

const text = "The cat sat on the mat";
const tags = posTagPerceptronAscii(text, { model });

console.log(tags);
// [
//   { token: "The", tag: "DT", tagId: 6, start: 0, length: 3 },
//   { token: "cat", tag: "NN", tagId: 0, start: 4, length: 3 },
//   { token: "sat", tag: "VBD", tagId: 4, start: 8, length: 3 },
//   ...
// ]

Performance Options

import { posTagPerceptronAscii } from "bun_nltk";

type PerceptronTaggerOptions = {
  model?: PerceptronTaggerModel;  // Pre-loaded model
  wasm?: WasmNltk;                // WASM runtime
  useWasm?: boolean;              // Prefer WASM (default: false)
  useNative?: boolean;            // Prefer native (default: true)
};

// Use native implementation (fastest)
const tags1 = posTagPerceptronAscii(text, { 
  model,
  useNative: true 
});

// Use WASM implementation
const tags2 = posTagPerceptronAscii(text, { 
  model,
  useWasm: true,
  wasm: wasmInstance 
});

// Use pure JavaScript fallback
const tags3 = posTagPerceptronAscii(text, { 
  model,
  useNative: false 
});

Model Structure

type PerceptronTaggerModel = {
  version: number;                    // Model version
  tags: string[];                     // Tag vocabulary
  featureCount: number;               // Number of features
  tagCount: number;                   // Number of tags
  featureIndex: Record<string, number>; // Feature name → ID
  weights: Float32Array;              // Model weights (featureCount × tagCount)
  metadata?: Record<string, unknown>; // Optional metadata
};

Features Used

The perceptron tagger uses these features for each token:

// Position-based features
"bias"                    // Always present
"w=<token>"               // Current word (lowercase)
"p1=<prefix>"             // First character
"p2=<prefix>"             // First 2 characters
"p3=<prefix>"             // First 3 characters  
"s1=<suffix>"             // Last character
"s2=<suffix>"             // Last 2 characters
"s3=<suffix>"             // Last 3 characters

// Context features
"prev=<token>"            // Previous token
"next=<token>"            // Next token

// Token properties
"is_upper=True"           // All uppercase
"is_title=True"           // Starts with capital
"has_digit=True"          // Contains digit
"has_hyphen=True"         // Contains hyphen

Example with Features

import { posTagPerceptronAscii, loadPerceptronTaggerModel } from "bun_nltk";

const model = loadPerceptronTaggerModel();
const text = "John's running quickly";
const tags = posTagPerceptronAscii(text, { model });

// For "running":
// Features include:
//   w=running
//   p1=r, p2=ru, p3=run
//   s1=g, s2=ng, s3=ing
//   prev=john, next=quickly
//   is_upper=False, is_title=False
//   has_digit=False, has_hyphen=False

Native High-Performance Tagger

import { posTagAsciiNative } from "bun_nltk";

const text = "The quick brown fox jumps";
const tags = posTagAsciiNative(text);

// Uses optimized SIMD implementation
// 10-100x faster than JavaScript version

Use posTagAsciiNative for maximum performance with the built-in rule-based tagger.

Common Use Cases

Extract All Nouns

import { posTagAsciiNative } from "bun_nltk";

const text = "The cat and dog played in the garden";
const tags = posTagAsciiNative(text);

const nouns = tags
  .filter(t => t.tag === "NN" || t.tag === "NNP")
  .map(t => t.token);

console.log(nouns);
// ["cat", "dog", "garden"]

Extract Proper Nouns (Named Entities)

import { posTagAsciiNative } from "bun_nltk";

const text = "John visited Paris and met Mary at the Eiffel Tower";
const tags = posTagAsciiNative(text);

const properNouns = tags
  .filter(t => t.tag === "NNP")
  .map(t => t.token);

console.log(properNouns);
// ["John", "Paris", "Mary", "Eiffel", "Tower"]

Extract Verb Phrases

import { posTagPerceptronAscii, loadPerceptronTaggerModel } from "bun_nltk";

const model = loadPerceptronTaggerModel();
const text = "She was running quickly and jumping high";
const tags = posTagPerceptronAscii(text, { model });

const verbs = tags
  .filter(t => t.tag.startsWith("VB"))
  .map(t => t.token);

console.log(verbs);
// ["was", "running", "jumping"]

Batch Processing

import { posTagAsciiNative } from "bun_nltk";

const sentences = [
  "The cat sleeps",
  "Dogs bark loudly",
  "Birds fly high"
];

const allTags = sentences.map(posTagAsciiNative);

// Process results
for (const [i, tags] of allTags.entries()) {
  console.log(`Sentence ${i}:`);
  for (const tag of tags) {
    console.log(`  ${tag.token} → ${tag.tag}`);
  }
}

Performance Comparison

posTagAsciiNative: Fastest, simple heuristics
posTagPerceptronAscii (native): Fast, more accurate
posTagPerceptronAscii (wasm): Moderate, portable
posTagAscii: Fast, JavaScript heuristics

The rule-based tagger is less accurate than the perceptron tagger, especially for ambiguous words. Use perceptron for production applications.

Preparing Custom Models

If you have a trained model in JSON format:

import { preparePerceptronTaggerModel } from "bun_nltk";

const modelJson = await Bun.file("custom_tagger.json").json();
const model = preparePerceptronTaggerModel(modelJson);

// Use model
const tags = posTagPerceptronAscii(text, { model });

Model JSON Format:

{
  "version": 1,
  "type": "perceptron",
  "tags": ["NN", "VB", "DT", ...],
  "feature_count": 50000,
  "tag_count": 10,
  "feature_index": {
    "bias": 0,
    "w=the": 1,
    "w=is": 2,
    ...
  },
  "weights": [0.5, -0.3, 0.8, ...]
}

The weights array must have exactly feature_count × tag_count elements.

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

Quick Start

Tag Set

Fast Rule-Based Tagger

Basic Usage

Heuristic Rules

Perceptron Tagger

Using a Pre-trained Model

Model Structure

Features Used

Example with Features

Native High-Performance Tagger

Common Use Cases

Extract All Nouns

Extract Proper Nouns (Named Entities)

Extract Verb Phrases

Batch Processing

Performance Comparison

Preparing Custom Models

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​Quick Start

​Tag Set

​Fast Rule-Based Tagger

​Basic Usage

​Heuristic Rules

​Perceptron Tagger

​Using a Pre-trained Model

​Model Structure

​Features Used

​Example with Features

​Native High-Performance Tagger

​Common Use Cases

​Extract All Nouns

​Extract Proper Nouns (Named Entities)

​Extract Verb Phrases

​Batch Processing

​Performance Comparison

​Preparing Custom Models

Build docs developers (and LLMs) love

Quick Start

Tag Set

Fast Rule-Based Tagger

Basic Usage

Heuristic Rules

Perceptron Tagger

Using a Pre-trained Model

Model Structure

Features Used

Example with Features

Native High-Performance Tagger

Common Use Cases

Extract All Nouns

Extract Proper Nouns (Named Entities)

Extract Verb Phrases

Batch Processing

Performance Comparison

Preparing Custom Models