Chunking

Overview

Chunking identifies non-overlapping linguistic structures like noun phrases, verb phrases, or named entities from part-of-speech tagged text.

Regexp Chunk Parser

Basic Usage

import { regexpChunkParse } from "bun_nltk";

const tagged = [
  { token: "the", tag: "DT" },
  { token: "big", tag: "JJ" },
  { token: "dog", tag: "NN" },
  { token: "barked", tag: "VBD" }
];

const grammar = `
NP: {<DT>?<JJ>*<NN>}
VP: {<VBD|VBZ>}
`;

const tree = regexpChunkParse(tagged, grammar);
console.log(tree);

Output:

[
  {
    kind: "chunk",
    label: "NP",
    tokens: [
      { token: "the", tag: "DT" },
      { token: "big", tag: "JJ" },
      { token: "dog", tag: "NN" }
    ]
  },
  {
    kind: "chunk",
    label: "VP",
    tokens: [{ token: "barked", tag: "VBD" }]
  }
]

Grammar Syntax

Rule Format

LABEL: {<TAG_PATTERN><TAG_PATTERN>...}

LABEL: Chunk label (alphanumeric + underscore)
TAG_PATTERN: Regular expression in angle brackets <...>
Quantifiers: ? (0-1), * (0+), + (1+)

Tag Patterns

const grammar = `
# Noun phrases
NP: {<DT>?<JJ>*<NN.*>+}

# Verb phrases  
VP: {<VB.*><RB>*}

# Prepositional phrases
PP: {<IN><DT>?<NN.*>+}
`;

Pattern Examples

Pattern	Matches
`<NN>`	Exactly “NN”
`<NN.*>`	”NN”, “NNS”, “NNP”, “NNPS”
`<VB\|VBD\|VBZ>`	”VB” or “VBD” or “VBZ”
`<DT>?`	Optional determiner
`<JJ>*`	Zero or more adjectives
`<NN>+`	One or more nouns

Advanced Chunking

Named Entity Recognition

import { posTag, wordTokenizeSubset, regexpChunkParse } from "bun_nltk";

const text = "Barack Obama visited New York City";
const tokens = wordTokenizeSubset(text);
const tagged = posTag(tokens);

const nerGrammar = `
PERSON: {<NNP><NNP>}
LOCATION: {<NNP><NNP><NNP>}
`;

const entities = regexpChunkParse(tagged, nerGrammar);

Multi-Rule Grammars

const grammar = `
# Base noun phrase
NP: {<DT|PRP\$>?<JJ>*<NN.*>+}

# Verb phrase
VP: {<MD>?<VB.*><RB>*}

# Prepositional phrase
PP: {<IN><NP>}

# Clause
CLAUSE: {<NP><VP><NP|PP>*}
`;

IOB Format Conversion

Chunk Tree to IOB

Convert chunk trees to Inside-Outside-Begin (IOB) format:

import { chunkTreeToIob } from "bun_nltk";

const tree = regexpChunkParse(tagged, grammar);
const iob = chunkTreeToIob(tree);

console.log(iob);

Output:

[
  { token: "the", tag: "DT", iob: "B-NP" },
  { token: "big", tag: "JJ", iob: "I-NP" },
  { token: "dog", tag: "NN", iob: "I-NP" },
  { token: "barked", tag: "VBD", iob: "B-VP" }
]

IOB Tags

B-LABEL: Beginning of chunk
I-LABEL: Inside chunk (continuation)
O: Outside any chunk

Type Definitions

export type TaggedToken = {
  token: string;
  tag: string;
};

export type ChunkNode = {
  kind: "chunk";
  label: string;
  tokens: TaggedToken[];
};

export type ChunkElement = TaggedToken | ChunkNode;

export type IobRow = {
  token: string;
  tag: string;
  iob: string;
};

Practical Examples

Extract Noun Phrases

import { posTag, wordTokenizeSubset, regexpChunkParse } from "bun_nltk";

function extractNounPhrases(text: string): string[] {
  const tokens = wordTokenizeSubset(text);
  const tagged = posTag(tokens);
  
  const grammar = `NP: {<DT>?<JJ>*<NN.*>+}`;
  const chunks = regexpChunkParse(tagged, grammar);
  
  return chunks
    .filter(node => typeof node !== "string" && node.kind === "chunk")
    .map(node => node.tokens.map(t => t.token).join(" "));
}

const text = "The quick brown fox jumped over the lazy dog";
const nps = extractNounPhrases(text);
console.log(nps); // ["The quick brown fox", "the lazy dog"]

Extract Action Phrases

function extractActions(text: string) {
  const tokens = wordTokenizeSubset(text);
  const tagged = posTag(tokens);
  
  const grammar = `
ACTION: {<VB.*><DT>?<JJ>*<NN.*>+}
  `;
  
  const chunks = regexpChunkParse(tagged, grammar);
  
  return chunks
    .filter(node => typeof node !== "string" && node.kind === "chunk")
    .map(node => ({
      action: node.tokens.map(t => t.token).join(" "),
      tokens: node.tokens
    }));
}

Custom Entity Types

const domainGrammar = `
# Product names (adjective + noun)
PRODUCT: {<JJ><NN>}

# Monetary amounts  
MONEY: {<\$><CD>}

# Dates
DATE: {<NNP><CD><,>?<CD>?}
`;

const tagged = posTag(wordTokenizeSubset("iPhone 15 costs $799 on September 22, 2023"));
const entities = regexpChunkParse(tagged, domainGrammar);

Performance Notes

The chunker uses native code optimization for better performance:

Pattern compilation is cached
Tag matching uses precompiled regex
IOB encoding uses efficient native implementation

// Native optimization automatically enabled
const chunks = regexpChunkParse(tagged, grammar);

Working with Chunk Trees

Filter Chunks by Label

function getChunksByLabel(tree: ChunkElement[], label: string) {
  return tree.filter(node => 
    typeof node !== "string" && 
    node.kind === "chunk" && 
    node.label === label
  );
}

const nounPhrases = getChunksByLabel(tree, "NP");

Extract Chunk Text

function chunkToText(chunk: ChunkElement): string {
  if (typeof chunk === "object" && "kind" in chunk) {
    return chunk.tokens.map(t => t.token).join(" ");
  }
  return chunk.token;
}

API Reference

`regexpChunkParse(tokens, grammar)`

Parses POS-tagged tokens into chunks using regular expression patterns. Parameters:

tokens: TaggedToken[] - POS-tagged tokens
grammar: string - Chunk grammar rules

Returns: ChunkElement[] - Mixed array of chunks and tokens

`chunkTreeToIob(tree)`

Converts chunk tree to IOB format. Parameters:

tree: ChunkElement[] - Chunk tree

Returns: IobRow[] - IOB-tagged tokens

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

Overview

Regexp Chunk Parser

Basic Usage

Grammar Syntax

Rule Format

Tag Patterns

Pattern Examples

Advanced Chunking

Named Entity Recognition

Multi-Rule Grammars

IOB Format Conversion

Chunk Tree to IOB

IOB Tags

Type Definitions

Practical Examples

Extract Noun Phrases

Extract Action Phrases

Custom Entity Types

Performance Notes

Working with Chunk Trees

Filter Chunks by Label

Extract Chunk Text

API Reference

`regexpChunkParse(tokens, grammar)`

`chunkTreeToIob(tree)`

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced Features

WASM & Browser

​Overview

​Regexp Chunk Parser

​Basic Usage

​Grammar Syntax

​Rule Format

​Tag Patterns

​Pattern Examples

​Advanced Chunking

​Named Entity Recognition

​Multi-Rule Grammars

​IOB Format Conversion

​Chunk Tree to IOB

​IOB Tags

​Type Definitions

​Practical Examples

​Extract Noun Phrases

​Extract Action Phrases

​Custom Entity Types

​Performance Notes

​Working with Chunk Trees

​Filter Chunks by Label

​Extract Chunk Text

​API Reference

​regexpChunkParse(tokens, grammar)

​chunkTreeToIob(tree)

Build docs developers (and LLMs) love

Overview

Regexp Chunk Parser

Basic Usage

Grammar Syntax

Rule Format

Tag Patterns

Pattern Examples

Advanced Chunking

Named Entity Recognition

Multi-Rule Grammars

IOB Format Conversion

Chunk Tree to IOB

IOB Tags

Type Definitions

Practical Examples

Extract Noun Phrases

Extract Action Phrases

Custom Entity Types

Performance Notes

Working with Chunk Trees

Filter Chunks by Label

Extract Chunk Text

API Reference

`regexpChunkParse(tokens, grammar)`

`chunkTreeToIob(tree)`