Feature Extraction

TextFeatureVectorizer

Converts text into sparse numerical feature vectors using n-gram tokenization. Used internally by classifiers like decision trees and linear models.

Constructor

new TextFeatureVectorizer(options?: {
  ngramMin?: number;
  ngramMax?: number;
  binary?: boolean;
  maxFeatures?: number;
})

Parameters:

ngramMin (optional): Minimum n-gram size (default: 1, minimum: 1)
ngramMax (optional): Maximum n-gram size (default: 1, minimum: ngramMin)
binary (optional): Use binary features (presence/absence) instead of counts (default: false)
maxFeatures (optional): Maximum vocabulary size (default: 12000, minimum: 64)

Example:

import { TextFeatureVectorizer } from "bun_nltk";

const vectorizer = new TextFeatureVectorizer({
  ngramMin: 1,
  ngramMax: 2,  // Use unigrams and bigrams
  binary: false,  // Count features
  maxFeatures: 5000,
});

Properties

featureCount

Get the number of features in the vocabulary.

get featureCount(): number

Example:

console.log(vectorizer.featureCount); // 0 before fitting
vectorizer.fit(["hello world", "hello there"]);
console.log(vectorizer.featureCount); // 4 after fitting

Methods

fit()

Build the vocabulary from a corpus of texts.

fit(texts: string[]): this

Parameters:

texts: Array of text documents

Returns: The vectorizer instance (for chaining) Details:

Extracts n-grams from all texts
Selects the most frequent n-grams up to maxFeatures
Builds internal feature-to-id mapping

Example:

const corpus = [
  "machine learning is fun",
  "deep learning is powerful",
  "learning algorithms",
];

vectorizer.fit(corpus);
console.log(vectorizer.vocabulary());
// ["learning", "is", "machine", "deep", ...]

transform()

Convert a single text into a sparse feature vector.

transform(text: string): SparseVector

Parameters:

text: The text to vectorize

Returns: SparseVector with indices and values arrays Example:

const vector = vectorizer.transform("machine learning");
console.log(vector);
// {
//   indices: Uint32Array [0, 2],  // Feature IDs
//   values: Float64Array [1, 1]   // Feature values
// }

transformMany()

Convert multiple texts into sparse vectors.

transformMany(texts: string[]): SparseVector[]

Parameters:

texts: Array of texts to vectorize

Returns: Array of sparse vectors Example:

const vectors = vectorizer.transformMany([
  "deep learning",
  "machine learning",
]);
console.log(vectors.length); // 2

vocabulary()

Get the ordered list of features.

vocabulary(): string[]

Returns: Array of feature strings sorted by feature ID Example:

const vocab = vectorizer.vocabulary();
console.log(vocab);
// ["learning", "is", "machine", "deep", "fun", "powerful", ...]

toJSON()

Serialize the vectorizer to JSON.

toJSON(): VectorizerSerialized

Returns: Serialized vectorizer object Example:

const data = vectorizer.toJSON();
await Bun.write("vectorizer.json", JSON.stringify(data));

fromJSON()

Load a vectorizer from serialized data.

static fromJSON(payload: VectorizerSerialized): TextFeatureVectorizer

Parameters:

payload: Serialized vectorizer data (version must be 1)

Returns: Loaded vectorizer instance Throws: Error if version is unsupported Example:

const data = await Bun.file("vectorizer.json").json();
const vectorizer = TextFeatureVectorizer.fromJSON(data);

Utility Functions

flattenSparseBatch()

Flatten a batch of sparse vectors into a compact representation for efficient batch processing.

flattenSparseBatch(rows: SparseVector[]): {
  docOffsets: Uint32Array;
  featureIds: Uint32Array;
  featureValues: Float64Array;
}

Parameters:

rows: Array of sparse vectors

Returns: Object with flattened representation:

docOffsets: Cumulative offsets for each document (length = rows.length + 1)
featureIds: Concatenated feature indices
featureValues: Concatenated feature values

Example:

import { flattenSparseBatch, TextFeatureVectorizer } from "bun_nltk";

const vectorizer = new TextFeatureVectorizer();
vectorizer.fit(["hello world", "hello there"]);

const vectors = vectorizer.transformMany(["hello", "world"]);
const batch = flattenSparseBatch(vectors);

console.log(batch);
// {
//   docOffsets: Uint32Array [0, 1, 2],
//   featureIds: Uint32Array [0, 1],
//   featureValues: Float64Array [1, 1]
// }

Types

SparseVector

type SparseVector = {
  indices: Uint32Array;  // Feature IDs (sorted)
  values: Float64Array;  // Feature values
};

Represents a document as a sparse vector where:

indices[i] is the feature ID
values[i] is the corresponding value (count or 1 if binary)
Arrays are parallel and sorted by feature ID

VectorizerSerialized

type VectorizerSerialized = {
  version: number;
  ngramMin: number;
  ngramMax: number;
  binary: boolean;
  maxFeatures: number;
  vocabulary: string[];
};

VectorizerOptions

type VectorizerOptions = {
  ngramMin?: number;
  ngramMax?: number;
  binary?: boolean;
  maxFeatures?: number;
};

Complete Example

import { TextFeatureVectorizer, flattenSparseBatch } from "bun_nltk";

// Create vectorizer with bigrams
const vectorizer = new TextFeatureVectorizer({
  ngramMin: 1,
  ngramMax: 2,
  binary: false,  // Use counts
  maxFeatures: 1000,
});

// Training corpus
const corpus = [
  "natural language processing",
  "machine learning algorithms",
  "deep learning neural networks",
  "text classification and sentiment analysis",
];

// Fit vocabulary
vectorizer.fit(corpus);
console.log(`Vocabulary size: ${vectorizer.featureCount}`);

// Get vocabulary
const vocab = vectorizer.vocabulary();
console.log("Top features:", vocab.slice(0, 10));

// Transform single document
const doc = "machine learning and deep learning";
const vector = vectorizer.transform(doc);
console.log("Sparse vector:");
for (let i = 0; i < vector.indices.length; i++) {
  const featureId = vector.indices[i]!;
  const value = vector.values[i]!;
  const feature = vocab[featureId];
  console.log(`  ${feature}: ${value}`);
}
// Output:
// machine: 1
// learning: 2
// and: 1
// deep: 1
// machine\u0001learning: 1
// deep\u0001learning: 1

// Transform batch
const documents = [
  "natural language",
  "machine learning",
  "deep learning",
];
const vectors = vectorizer.transformMany(documents);
console.log(`Transformed ${vectors.length} documents`);

// Flatten for batch processing
const batch = flattenSparseBatch(vectors);
console.log("Flattened batch:");
console.log(`  Documents: ${batch.docOffsets.length - 1}`);
console.log(`  Total features: ${batch.featureIds.length}`);

// Save vectorizer
const data = vectorizer.toJSON();
await Bun.write("vectorizer.json", JSON.stringify(data));

// Load vectorizer
const loaded = TextFeatureVectorizer.fromJSON(
  await Bun.file("vectorizer.json").json()
);
const newVector = loaded.transform("natural language processing");
console.log("Loaded vectorizer works!", newVector.indices.length);

N-gram Configuration

Unigrams Only (1,1)

const vectorizer = new TextFeatureVectorizer({
  ngramMin: 1,
  ngramMax: 1,
});

vectorizer.fit(["hello world"]);
console.log(vectorizer.vocabulary());
// ["hello", "world"]

Bigrams Only (2,2)

const vectorizer = new TextFeatureVectorizer({
  ngramMin: 2,
  ngramMax: 2,
});

vectorizer.fit(["hello world there"]);
console.log(vectorizer.vocabulary());
// ["hello\u0001world", "world\u0001there"]

Unigrams + Bigrams (1,2)

const vectorizer = new TextFeatureVectorizer({
  ngramMin: 1,
  ngramMax: 2,
});

vectorizer.fit(["hello world"]);
console.log(vectorizer.vocabulary());
// ["hello", "world", "hello\u0001world"]

Unigrams + Bigrams + Trigrams (1,3)

const vectorizer = new TextFeatureVectorizer({
  ngramMin: 1,
  ngramMax: 3,
});

vectorizer.fit(["hello world there"]);
console.log(vectorizer.vocabulary());
// [
//   "hello", "world", "there",
//   "hello\u0001world", "world\u0001there",
//   "hello\u0001world\u0001there"
// ]

Binary vs. Count Features

Count Features (binary: false)

const vectorizer = new TextFeatureVectorizer({ binary: false });
vectorizer.fit(["hello"]);

const vector = vectorizer.transform("hello hello world");
// vector.values = [2, 1]  // "hello" appears twice

Binary Features (binary: true)

const vectorizer = new TextFeatureVectorizer({ binary: true });
vectorizer.fit(["hello"]);

const vector = vectorizer.transform("hello hello world");
// vector.values = [1, 1]  // Just presence/absence

Tokenization

The vectorizer uses the regex /[A-Za-z0-9']+/g to tokenize text:

Extracts alphanumeric sequences and apostrophes
Converts to lowercase
Splits on whitespace and punctuation

Examples:

"Hello, world!" → ["hello", "world"]
"it's great" → ["it's", "great"]
"[email protected]" → ["user", "email", "com"]

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

Feature Extraction

TextFeatureVectorizer

Constructor

Properties

featureCount

Methods

fit()

transform()

transformMany()

vocabulary()

toJSON()

fromJSON()

Utility Functions

flattenSparseBatch()

Types

SparseVector

VectorizerSerialized

VectorizerOptions

Complete Example

N-gram Configuration

Unigrams Only (1,1)

Bigrams Only (2,2)

Unigrams + Bigrams (1,2)

Unigrams + Bigrams + Trigrams (1,3)

Binary vs. Count Features

Count Features (binary: false)

Binary Features (binary: true)

Tokenization

Build docs developers (and LLMs) love

Tokenization

Text Processing

Tagging & Analysis

Language Models

Parsing

Classification

WordNet

Corpus

WASM Runtime

Native APIs

​TextFeatureVectorizer

​Constructor

​Properties

​featureCount

​Methods

​fit()

​transform()

​transformMany()

​vocabulary()

​toJSON()

​fromJSON()

​Utility Functions

​flattenSparseBatch()

​Types

​SparseVector

​VectorizerSerialized

​VectorizerOptions

​Complete Example

​N-gram Configuration

​Unigrams Only (1,1)

​Bigrams Only (2,2)

​Unigrams + Bigrams (1,2)

​Unigrams + Bigrams + Trigrams (1,3)

​Binary vs. Count Features

​Count Features (binary: false)

​Binary Features (binary: true)

​Tokenization

Build docs developers (and LLMs) love

TextFeatureVectorizer

Constructor

Properties

featureCount

Methods

fit()

transform()

transformMany()

vocabulary()

toJSON()

fromJSON()

Utility Functions

flattenSparseBatch()

Types

SparseVector

VectorizerSerialized

VectorizerOptions

Complete Example

N-gram Configuration

Unigrams Only (1,1)

Bigrams Only (2,2)

Unigrams + Bigrams (1,2)

Unigrams + Bigrams + Trigrams (1,3)

Binary vs. Count Features

Count Features (binary: false)

Binary Features (binary: true)

Tokenization