Chunking

regexpChunkParse

Parse POS-tagged tokens into chunks using regular expression patterns.

function regexpChunkParse(
  tokens: TaggedToken[],
  grammar: string
): ChunkElement[]

Parameters

tokens

TaggedToken[]

required

Array of tokens with POS tags. Each token has:

token: string - The word/token
tag: string - POS tag (e.g., “NN”, “VB”, “JJ”)

grammar

string

required

Chunking grammar rules. Each rule follows the format:

Label: {<TagPattern><Quantifier> ...}

Tag Patterns:

<NN.*> - Matches any tag starting with NN (nouns)
<JJ> - Matches exactly JJ (adjectives)
<VB|MD> - Matches VB or MD (verbs or modals)

Quantifiers:

? - Zero or one occurrence
* - Zero or more occurrences
+ - One or more occurrences
(none) - Exactly one occurrence

Returns

Array of chunk elements, where each element is either:

TaggedToken - Unchunked token with token and tag
ChunkNode - Chunked phrase with:
- kind: "chunk"
- label: string - Chunk type (e.g., “NP”, “VP”)
- tokens: TaggedToken[] - Tokens in the chunk

Example

import { regexpChunkParse } from "bun_nltk";

const tokens = [
  { token: "The", tag: "DT" },
  { token: "quick", tag: "JJ" },
  { token: "brown", tag: "JJ" },
  { token: "fox", tag: "NN" },
  { token: "jumps", tag: "VBZ" },
];

const grammar = `
NP: {<DT>?<JJ>*<NN.*>+}
VP: {<VB.*>}
`;

const chunks = regexpChunkParse(tokens, grammar);
// [
//   { kind: "chunk", label: "NP", tokens: [{"The", "DT"}, {"quick", "JJ"}, {"brown", "JJ"}, {"fox", "NN"}] },
//   { kind: "chunk", label: "VP", tokens: [{"jumps", "VBZ"}] }
// ]

Grammar Rules

Define chunk patterns with labels and tag sequences:

# Noun phrases
NP: {<DT>?<JJ>*<NN.*>+}

# Verb phrases  
VP: {<VB.*><RB>?}

# Prepositional phrases
PP: {<IN><DT>?<NN.*>+}

Comments start with #. Rules can span multiple lines.

chunkTreeToIob

Convert chunk tree structure to IOB (Inside-Outside-Begin) format.

function chunkTreeToIob(tree: ChunkElement[]): IobRow[]

Parameters

tree

ChunkElement[]

required

Chunk tree from regexpChunkParse

Returns

Array of IOB-tagged rows with:

token: string - The word/token
tag: string - POS tag
iob: string - IOB tag:
- "O" - Outside any chunk
- "B-Label" - Beginning of chunk with label
- "I-Label" - Inside chunk with label

Example

import { regexpChunkParse, chunkTreeToIob } from "bun_nltk";

const tokens = [
  { token: "The", tag: "DT" },
  { token: "dog", tag: "NN" },
  { token: "runs", tag: "VBZ" },
];

const grammar = "NP: {<DT><NN>}";
const chunks = regexpChunkParse(tokens, grammar);
const iob = chunkTreeToIob(chunks);

console.log(iob);
// [
//   { token: "The", tag: "DT", iob: "B-NP" },
//   { token: "dog", tag: "NN", iob: "I-NP" },
//   { token: "runs", tag: "VBZ", iob: "O" }
// ]

Use Cases

Training sequence labeling models
Named entity recognition data preparation
Chunk boundary detection
Converting between chunk representations

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

regexpChunkParse

Parameters

Returns

Example

Grammar Rules

chunkTreeToIob

Parameters

Returns

Example

Use Cases

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​regexpChunkParse

​Parameters

​Returns

​Example

​Grammar Rules

​chunkTreeToIob

​Parameters

​Returns

​Example

​Use Cases

Build docs developers (and LLMs) love

regexpChunkParse

Parameters

Returns

Example

Grammar Rules

chunkTreeToIob

Parameters

Returns

Example

Use Cases