Skip to main content

NativeFreqDistStream

A streaming frequency distribution collector that efficiently tracks token, bigram, and conditional frequencies without storing all text in memory.
class NativeFreqDistStream {
  constructor()
  update(text: string): void
  flush(): void
  tokenUniqueCount(): number
  bigramUniqueCount(): number
  conditionalUniqueCount(): number
  tokenFreqDistHash(): Map<bigint, number>
  bigramFreqDistHash(): StreamBigramFreq[]
  conditionalFreqDistHash(): StreamConditionalFreq[]
  toJson(): string
  dispose(): void
}

Constructor

Creates a new frequency distribution stream.
const stream = new NativeFreqDistStream();

Methods

update

Processes text and updates frequency distributions.
update(text: string): void
text
string
required
Text to process and add to the distributions.
Example:
const stream = new NativeFreqDistStream();
stream.update("The quick brown fox");
stream.update("The lazy dog");

flush

Flushes internal buffers. Call this after the last update() before reading results.
flush(): void
Example:
stream.update("Final text");
stream.flush();
const counts = stream.tokenUniqueCount();

tokenUniqueCount

Returns the number of unique tokens observed.
tokenUniqueCount(): number
Returns: Count of unique tokens. Example:
const uniqueTokens = stream.tokenUniqueCount();
console.log(`Found ${uniqueTokens} unique tokens`);

bigramUniqueCount

Returns the number of unique bigrams (token pairs) observed.
bigramUniqueCount(): number
Returns: Count of unique bigrams.

conditionalUniqueCount

Returns the number of unique conditional frequencies (POS tag + token combinations).
conditionalUniqueCount(): number
Returns: Count of unique conditional pairs.

tokenFreqDistHash

Returns token frequencies as a map of token hashes to counts.
tokenFreqDistHash(): Map<bigint, number>
Returns: Map where keys are token hashes (bigint) and values are occurrence counts. Example:
const freqs = stream.tokenFreqDistHash();
for (const [hash, count] of freqs) {
  console.log(`Hash ${hash}: ${count} occurrences`);
}

bigramFreqDistHash

Returns bigram frequencies.
bigramFreqDistHash(): StreamBigramFreq[]
Returns: Array of bigram frequency objects.
StreamBigramFreq[]
array
Example:
const bigrams = stream.bigramFreqDistHash();
for (const bg of bigrams) {
  console.log(`Bigram (${bg.leftHash}, ${bg.rightHash}): ${bg.count}`);
}

conditionalFreqDistHash

Returns conditional frequencies (POS tags with associated tokens).
conditionalFreqDistHash(): StreamConditionalFreq[]
Returns: Array of conditional frequency objects.
StreamConditionalFreq[]
array
Example:
const conditional = stream.conditionalFreqDistHash();
for (const cf of conditional) {
  console.log(`Tag ${cf.tagId} + Token ${cf.tokenHash}: ${cf.count}`);
}

toJson

Serializes all distributions to JSON format.
toJson(): string
Returns: JSON string containing tokens, bigrams, and conditional tags. Example:
const json = stream.toJson();
const data = JSON.parse(json);
console.log(data.tokens);
console.log(data.bigrams);
console.log(data.conditional_tags);

dispose

Frees native resources. Call this when done with the stream to prevent memory leaks.
dispose(): void
Example:
const stream = new NativeFreqDistStream();
try {
  stream.update("Some text");
  stream.flush();
  const counts = stream.tokenFreqDistHash();
} finally {
  stream.dispose();
}

Complete Example

import { NativeFreqDistStream } from 'bun_nltk';

const stream = new NativeFreqDistStream();

try {
  // Process multiple documents
  stream.update("The quick brown fox jumps over the lazy dog");
  stream.update("The lazy dog sleeps all day");
  stream.update("The quick fox runs fast");
  
  // Flush before reading
  stream.flush();
  
  // Get statistics
  console.log(`Unique tokens: ${stream.tokenUniqueCount()}`);
  console.log(`Unique bigrams: ${stream.bigramUniqueCount()}`);
  
  // Get frequency distributions
  const tokenFreqs = stream.tokenFreqDistHash();
  console.log(`Total distinct tokens tracked: ${tokenFreqs.size}`);
  
  const bigrams = stream.bigramFreqDistHash();
  console.log(`Total bigrams: ${bigrams.length}`);
  
  // Export to JSON
  const json = stream.toJson();
  console.log("Serialized data:", json);
  
} finally {
  // Always dispose to free native memory
  stream.dispose();
}

tokenFreqDistIdsAscii

Computes token frequency distribution with ID mapping for ASCII text.
function tokenFreqDistIdsAscii(text: string): TokenFreqDistIds

Parameters

text
string
required
Input text to analyze. Must be ASCII-compatible.

Returns

TokenFreqDistIds
object
Token frequency data with ID mappings.

Example

import { tokenFreqDistIdsAscii } from 'bun_nltk';

const text = "the quick brown fox jumps over the lazy dog";
const dist = tokenFreqDistIdsAscii(text);

console.log(dist.tokens);
// ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog"]

console.log(dist.counts);
// [2, 1, 1, 1, 1, 1, 1, 1]

console.log(dist.tokenToId.get("the"));
// 0

console.log(dist.totalTokens);
// 9

Use Cases

  • Building vocabulary indices for ML models
  • Efficient token-to-ID mapping for further processing
  • Analyzing word frequency distributions
  • Preparing data for n-gram analysis

ngramFreqDistHashAscii

Computes n-gram frequency distribution using hash-based representation.
function ngramFreqDistHashAscii(text: string, n: number): Map<bigint, number>

Parameters

text
string
required
Input text to analyze.
n
number
required
N-gram size (e.g., 2 for bigrams, 3 for trigrams). Must be a positive integer.

Returns

Map<bigint, number>
Map
Map where keys are n-gram hashes and values are occurrence counts.

Example

import { ngramFreqDistHashAscii } from 'bun_nltk';

const text = "the quick brown fox jumps over the lazy dog";

// Bigram frequencies
const bigrams = ngramFreqDistHashAscii(text, 2);
console.log(`Found ${bigrams.size} unique bigrams`);

// Trigram frequencies
const trigrams = ngramFreqDistHashAscii(text, 3);
console.log(`Found ${trigrams.size} unique trigrams`);

for (const [hash, count] of bigrams) {
  console.log(`N-gram hash ${hash}: ${count} occurrences`);
}

Performance

Hash-based representation allows constant-time lookups and insertions, making this extremely efficient for large texts.
Hashes do not preserve the original n-gram text. Use ngramsAsciiNative() if you need the actual token sequences.

Build docs developers (and LLMs) love