NativeFreqDistStream
A streaming frequency distribution collector that efficiently tracks token, bigram, and conditional frequencies without storing all text in memory.
class NativeFreqDistStream {
constructor()
update(text: string): void
flush(): void
tokenUniqueCount(): number
bigramUniqueCount(): number
conditionalUniqueCount(): number
tokenFreqDistHash(): Map<bigint, number>
bigramFreqDistHash(): StreamBigramFreq[]
conditionalFreqDistHash(): StreamConditionalFreq[]
toJson(): string
dispose(): void
}
Constructor
Creates a new frequency distribution stream.
const stream = new NativeFreqDistStream();
Methods
update
Processes text and updates frequency distributions.
update(text: string): void
Text to process and add to the distributions.
Example:
const stream = new NativeFreqDistStream();
stream.update("The quick brown fox");
stream.update("The lazy dog");
flush
Flushes internal buffers. Call this after the last update() before reading results.
Example:
stream.update("Final text");
stream.flush();
const counts = stream.tokenUniqueCount();
tokenUniqueCount
Returns the number of unique tokens observed.
tokenUniqueCount(): number
Returns: Count of unique tokens.
Example:
const uniqueTokens = stream.tokenUniqueCount();
console.log(`Found ${uniqueTokens} unique tokens`);
bigramUniqueCount
Returns the number of unique bigrams (token pairs) observed.
bigramUniqueCount(): number
Returns: Count of unique bigrams.
conditionalUniqueCount
Returns the number of unique conditional frequencies (POS tag + token combinations).
conditionalUniqueCount(): number
Returns: Count of unique conditional pairs.
tokenFreqDistHash
Returns token frequencies as a map of token hashes to counts.
tokenFreqDistHash(): Map<bigint, number>
Returns: Map where keys are token hashes (bigint) and values are occurrence counts.
Example:
const freqs = stream.tokenFreqDistHash();
for (const [hash, count] of freqs) {
console.log(`Hash ${hash}: ${count} occurrences`);
}
bigramFreqDistHash
Returns bigram frequencies.
bigramFreqDistHash(): StreamBigramFreq[]
Returns: Array of bigram frequency objects.
Show StreamBigramFreq properties
Hash of the second token.
Number of times this bigram occurred.
Example:
const bigrams = stream.bigramFreqDistHash();
for (const bg of bigrams) {
console.log(`Bigram (${bg.leftHash}, ${bg.rightHash}): ${bg.count}`);
}
conditionalFreqDistHash
Returns conditional frequencies (POS tags with associated tokens).
conditionalFreqDistHash(): StreamConditionalFreq[]
Returns: Array of conditional frequency objects.
Show StreamConditionalFreq properties
Frequency count for this tag-token pair.
Example:
const conditional = stream.conditionalFreqDistHash();
for (const cf of conditional) {
console.log(`Tag ${cf.tagId} + Token ${cf.tokenHash}: ${cf.count}`);
}
toJson
Serializes all distributions to JSON format.
Returns: JSON string containing tokens, bigrams, and conditional tags.
Example:
const json = stream.toJson();
const data = JSON.parse(json);
console.log(data.tokens);
console.log(data.bigrams);
console.log(data.conditional_tags);
dispose
Frees native resources. Call this when done with the stream to prevent memory leaks.
Example:
const stream = new NativeFreqDistStream();
try {
stream.update("Some text");
stream.flush();
const counts = stream.tokenFreqDistHash();
} finally {
stream.dispose();
}
Complete Example
import { NativeFreqDistStream } from 'bun_nltk';
const stream = new NativeFreqDistStream();
try {
// Process multiple documents
stream.update("The quick brown fox jumps over the lazy dog");
stream.update("The lazy dog sleeps all day");
stream.update("The quick fox runs fast");
// Flush before reading
stream.flush();
// Get statistics
console.log(`Unique tokens: ${stream.tokenUniqueCount()}`);
console.log(`Unique bigrams: ${stream.bigramUniqueCount()}`);
// Get frequency distributions
const tokenFreqs = stream.tokenFreqDistHash();
console.log(`Total distinct tokens tracked: ${tokenFreqs.size}`);
const bigrams = stream.bigramFreqDistHash();
console.log(`Total bigrams: ${bigrams.length}`);
// Export to JSON
const json = stream.toJson();
console.log("Serialized data:", json);
} finally {
// Always dispose to free native memory
stream.dispose();
}
tokenFreqDistIdsAscii
Computes token frequency distribution with ID mapping for ASCII text.
function tokenFreqDistIdsAscii(text: string): TokenFreqDistIds
Parameters
Input text to analyze. Must be ASCII-compatible.
Returns
Token frequency data with ID mappings.Show TokenFreqDistIds properties
Array of unique tokens in order.
Frequency count for each token (parallel to tokens array).
Map from token string to its ID/index.
Total number of tokens (including duplicates).
Example
import { tokenFreqDistIdsAscii } from 'bun_nltk';
const text = "the quick brown fox jumps over the lazy dog";
const dist = tokenFreqDistIdsAscii(text);
console.log(dist.tokens);
// ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog"]
console.log(dist.counts);
// [2, 1, 1, 1, 1, 1, 1, 1]
console.log(dist.tokenToId.get("the"));
// 0
console.log(dist.totalTokens);
// 9
Use Cases
- Building vocabulary indices for ML models
- Efficient token-to-ID mapping for further processing
- Analyzing word frequency distributions
- Preparing data for n-gram analysis
ngramFreqDistHashAscii
Computes n-gram frequency distribution using hash-based representation.
function ngramFreqDistHashAscii(text: string, n: number): Map<bigint, number>
Parameters
N-gram size (e.g., 2 for bigrams, 3 for trigrams). Must be a positive integer.
Returns
Map where keys are n-gram hashes and values are occurrence counts.
Example
import { ngramFreqDistHashAscii } from 'bun_nltk';
const text = "the quick brown fox jumps over the lazy dog";
// Bigram frequencies
const bigrams = ngramFreqDistHashAscii(text, 2);
console.log(`Found ${bigrams.size} unique bigrams`);
// Trigram frequencies
const trigrams = ngramFreqDistHashAscii(text, 3);
console.log(`Found ${trigrams.size} unique trigrams`);
for (const [hash, count] of bigrams) {
console.log(`N-gram hash ${hash}: ${count} occurrences`);
}
Hash-based representation allows constant-time lookups and insertions, making this extremely efficient for large texts.
Hashes do not preserve the original n-gram text. Use ngramsAsciiNative() if you need the actual token sequences.