Overview
The vectorization module provides methods to convert preprocessed text into numerical feature vectors suitable for machine learning models. The system supports multiple vectorization strategies optimized for language detection.TF-IDF Vectorization
TfidfVectorizer
The primary vectorization method using Term Frequency-Inverse Document Frequency (TF-IDF).Parameters
Maximum number of features (vocabulary size). Commonly set to
5000 for language detection tasks.Minimum document frequency. Features appearing in fewer documents are ignored. Recommended value:
2.Maximum document frequency. Features appearing in more than this proportion of documents are ignored. Recommended value:
0.95 to filter common words.Range of n-grams to extract.
(1, 2) extracts unigrams and bigrams.Analysis level:
'word' for word-level features or 'char' for character-level features.Enable inverse-document-frequency reweighting.
Smooth IDF weights by adding one to document frequencies.
Methods
fit(X, y=None)
Learn vocabulary and IDF weights from training data.An iterable of text documents (list of strings or pandas Series).
Target labels (not used, present for API consistency).
Returns self (the fitted vectorizer).
transform(X)
Transform documents to TF-IDF feature matrix.An iterable of text documents to transform.
TF-IDF-weighted document-term matrix with shape
(n_documents, n_features).fit_transform(X, y=None)
Learn vocabulary and IDF weights, then transform documents.An iterable of text documents.
Target labels (not used).
TF-IDF-weighted document-term matrix.
Vectorization Strategies
Character-based TF-IDF
Optimal for language detection due to character-level patterns unique to each language.- Captures language-specific character patterns
- Robust to spelling variations
- Works well with short texts
analyzer='char'- Use character-level featuresngram_range=(2, 4)- Extract 2-grams, 3-grams, and 4-grams
Word-based TF-IDF
Uses word-level features with n-grams.- Captures word choice patterns
- Lower dimensionality than character n-grams
- Better interpretability
analyzer='word'- Use word-level featuresngram_range=(1, 2)- Extract unigrams and bigramsmax_features=5000- Limit vocabulary sizemin_df=2- Remove rare words
Custom Vectorizers
LetterFrequencyVectorizer
A custom vectorizer that computes letter frequency distributions.List of characters to compute frequencies for. Includes accented characters common in European languages.
Methods
fit(X, y=None) - No-op method for API consistency transform(X) - Computes letter frequency vectorsAn iterable of text documents.
Array of shape
(n_documents, n_letters) containing normalized letter frequencies.Example
Alternative Vectorizers
HashingVectorizer
Memory-efficient vectorization using hashing.Number of features (hash buckets). Use
2**18 (262,144) for language detection.Set to
False to ensure all feature values are non-negative (required for some algorithms like Naive Bayes).- Constant memory footprint
- No vocabulary needed
- Fast transformation
- Hash collisions may occur
- Features are not interpretable
- Cannot use
inverse_transform()
CountVectorizer
Simple word count vectorization.Complete Vectorization Function
Example Usage
Performance Considerations
Memory Usage
- TfidfVectorizer: Memory proportional to vocabulary size
- HashingVectorizer: Fixed memory footprint
- LetterFrequencyVectorizer: Minimal memory (34 features per document)
Speed
- Character n-grams: Slower than word n-grams but more accurate
- Hashing: Fastest transformation
- Letter frequency: Fast but lower accuracy
Accuracy
For language detection:- Character TF-IDF (highest accuracy)
- Word TF-IDF
- Letter frequency (lowest accuracy but fastest)
Related Documentation
- Preprocessing API - Prepare text before vectorization
- Models API - Train classifiers on vectorized features
- Training Guide - Optimize vectorization parameters