TextFeatureVectorizer
Converts text into sparse numerical feature vectors using n-gram tokenization. Used internally by classifiers like decision trees and linear models.Constructor
ngramMin(optional): Minimum n-gram size (default: 1, minimum: 1)ngramMax(optional): Maximum n-gram size (default: 1, minimum: ngramMin)binary(optional): Use binary features (presence/absence) instead of counts (default: false)maxFeatures(optional): Maximum vocabulary size (default: 12000, minimum: 64)
Properties
featureCount
Get the number of features in the vocabulary.Methods
fit()
Build the vocabulary from a corpus of texts.texts: Array of text documents
- Extracts n-grams from all texts
- Selects the most frequent n-grams up to
maxFeatures - Builds internal feature-to-id mapping
transform()
Convert a single text into a sparse feature vector.text: The text to vectorize
SparseVector with indices and values arrays
Example:
transformMany()
Convert multiple texts into sparse vectors.texts: Array of texts to vectorize
vocabulary()
Get the ordered list of features.toJSON()
Serialize the vectorizer to JSON.fromJSON()
Load a vectorizer from serialized data.payload: Serialized vectorizer data (version must be 1)
Utility Functions
flattenSparseBatch()
Flatten a batch of sparse vectors into a compact representation for efficient batch processing.rows: Array of sparse vectors
docOffsets: Cumulative offsets for each document (length = rows.length + 1)featureIds: Concatenated feature indicesfeatureValues: Concatenated feature values
Types
SparseVector
indices[i]is the feature IDvalues[i]is the corresponding value (count or 1 if binary)- Arrays are parallel and sorted by feature ID
VectorizerSerialized
VectorizerOptions
Complete Example
N-gram Configuration
Unigrams Only (1,1)
Bigrams Only (2,2)
Unigrams + Bigrams (1,2)
Unigrams + Bigrams + Trigrams (1,3)
Binary vs. Count Features
Count Features (binary: false)
Binary Features (binary: true)
Tokenization
The vectorizer uses the regex/[A-Za-z0-9']+/g to tokenize text:
- Extracts alphanumeric sequences and apostrophes
- Converts to lowercase
- Splits on whitespace and punctuation
"Hello, world!"→["hello", "world"]"it's great"→["it's", "great"]"[email protected]"→["user", "email", "com"]