Collocations

topPmiBigramsAscii

Finds the top-K bigrams with highest Pointwise Mutual Information (PMI) scores.

text

string

required

Input text to analyze for collocations

topK

number

required

Number of top bigrams to return (must be positive integer)

windowSize

number

default:"2"

Context window size for co-occurrence (must be integer >= 2)

return

PmiBigram[]

Array of bigrams sorted by PMI score (highest first)

PmiBigram.leftHash

bigint

Hash of the left token

PmiBigram.rightHash

bigint

Hash of the right token

PmiBigram.score

number

PMI score (higher = stronger association)

import { topPmiBigramsAscii, hashTokenAscii } from 'bun_nltk';

const text = `
  Machine learning and deep learning are subfields of artificial intelligence.
  Machine learning uses statistical learning to enable machines to learn.
  Deep learning uses neural networks for learning patterns.
`;

const topBigrams = topPmiBigramsAscii(text, 5);
// Returns top 5 bigrams by PMI score:
// [
//   { leftHash: 123456n, rightHash: 789012n, score: 4.32 },
//   { leftHash: 345678n, rightHash: 901234n, score: 3.87 },
//   ...
// ]

// To decode hashes back to words, compare with token hashes:
const mlHash = hashTokenAscii("machine");
const learningHash = hashTokenAscii("learning");

// Check if a bigram matches "machine learning"
const hasMachineLearning = topBigrams.some(
  bg => bg.leftHash === mlHash && bg.rightHash === learningHash
);

PMI measures how much more often two words appear together than would be expected by chance. Higher scores indicate stronger collocations.

bigramWindowStatsAscii

Computes comprehensive statistics for all bigrams within a sliding window.

text

string

required

Input text to analyze

windowSize

number

default:"2"

Size of context window for co-occurrence (must be integer >= 2)

return

BigramWindowStatToken[]

Array of bigram statistics with decoded token strings

BigramWindowStatToken.left

string

Left token string

BigramWindowStatToken.right

string

Right token string

BigramWindowStatToken.leftId

number

Vocabulary ID of left token

BigramWindowStatToken.rightId

number

Vocabulary ID of right token

BigramWindowStatToken.count

number

Number of times this bigram co-occurs within the window

BigramWindowStatToken.pmi

number

Pointwise Mutual Information score

import { bigramWindowStatsAscii } from 'bun_nltk';

const text = "natural language processing and natural language understanding";
const stats = bigramWindowStatsAscii(text, 3);

// Returns all bigram pairs within window of 3:
// [
//   { left: "natural", right: "language", leftId: 0, rightId: 1, count: 2, pmi: 2.17 },
//   { left: "language", right: "processing", leftId: 1, rightId: 2, count: 1, pmi: 1.58 },
//   { left: "natural", right: "processing", leftId: 0, rightId: 2, count: 1, pmi: 1.32 },
//   ...
// ]

// Filter for high PMI collocations
const strongCollocations = stats.filter(s => s.pmi > 2.0 && s.count >= 2);

bigramWindowStatsAsciiIds

Low-level version returning token IDs instead of strings (for performance).

text

string

required

Input text to analyze

windowSize

number

default:"2"

Size of context window (must be integer >= 2)

return

BigramWindowStatId[]

Array of bigram statistics using vocabulary IDs

BigramWindowStatId.leftId

number

Vocabulary ID of left token

BigramWindowStatId.rightId

number

Vocabulary ID of right token

BigramWindowStatId.count

number

Co-occurrence count

BigramWindowStatId.pmi

number

PMI score

import { bigramWindowStatsAsciiIds, tokenFreqDistIdsAscii } from 'bun_nltk';

const text = "machine learning and deep learning";

// Get token vocabulary
const vocab = tokenFreqDistIdsAscii(text);
// vocab.tokens: ["machine", "learning", "and", "deep"]
// vocab.tokenToId: Map { "machine" => 0, "learning" => 1, ... }

// Get bigram stats with IDs
const stats = bigramWindowStatsAsciiIds(text, 3);
// [
//   { leftId: 0, rightId: 1, count: 1, pmi: 1.58 },  // machine -> learning
//   { leftId: 1, rightId: 3, count: 1, pmi: 1.58 },  // learning -> deep
//   ...
// ]

// Decode IDs to tokens
for (const stat of stats) {
  const leftToken = vocab.tokens[stat.leftId];
  const rightToken = vocab.tokens[stat.rightId];
  console.log(`${leftToken} + ${rightToken}: PMI=${stat.pmi.toFixed(2)}`);
}

Use the ID version when processing large texts or when you need to perform multiple analyses with the same vocabulary.

PMI Formula

PMI is calculated as:

PMI(w1, w2) = log2( P(w1, w2) / (P(w1) * P(w2)) )

Where:

P(w1, w2) = probability of w1 and w2 co-occurring within window
P(w1) = probability of w1 appearing
P(w2) = probability of w2 appearing

Example: Finding Domain-Specific Phrases

import { bigramWindowStatsAscii } from 'bun_nltk';

const corpus = `
  Convolutional neural networks excel at image recognition.
  Recurrent neural networks process sequential data.
  Neural networks require large training datasets.
  Deep neural networks have multiple hidden layers.
`;

const stats = bigramWindowStatsAscii(corpus, 4);

// Find phrases that appear multiple times with high PMI
const phrases = stats
  .filter(s => s.count >= 2 && s.pmi > 1.5)
  .sort((a, b) => b.pmi - a.pmi)
  .slice(0, 10);

phrases.forEach(p => {
  console.log(`"${p.left} ${p.right}": count=${p.count}, PMI=${p.pmi.toFixed(2)}`);
});
// Output:
// "neural networks": count=4, PMI=3.17
// "training datasets": count=2, PMI=2.58
// ...

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

topPmiBigramsAscii

bigramWindowStatsAscii

bigramWindowStatsAsciiIds

PMI Formula

Example: Finding Domain-Specific Phrases

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​topPmiBigramsAscii

​bigramWindowStatsAscii

​bigramWindowStatsAsciiIds

​PMI Formula

​Example: Finding Domain-Specific Phrases

Build docs developers (and LLMs) love

topPmiBigramsAscii

bigramWindowStatsAscii

bigramWindowStatsAsciiIds

PMI Formula

Example: Finding Domain-Specific Phrases