Skip to main content

Overview

The vectorization module provides methods to convert preprocessed text into numerical feature vectors suitable for machine learning models. The system supports multiple vectorization strategies optimized for language detection.

TF-IDF Vectorization

TfidfVectorizer

The primary vectorization method using Term Frequency-Inverse Document Frequency (TF-IDF).
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    min_df=2,
    max_df=0.95,
    ngram_range=(1, 2),
    analyzer='word',
    use_idf=True,
    smooth_idf=True
)

Parameters

max_features
int
default:"None"
Maximum number of features (vocabulary size). Commonly set to 5000 for language detection tasks.
min_df
int | float
default:"1"
Minimum document frequency. Features appearing in fewer documents are ignored. Recommended value: 2.
max_df
float
default:"1.0"
Maximum document frequency. Features appearing in more than this proportion of documents are ignored. Recommended value: 0.95 to filter common words.
ngram_range
tuple
default:"(1, 1)"
Range of n-grams to extract. (1, 2) extracts unigrams and bigrams.
analyzer
str
default:"'word'"
Analysis level: 'word' for word-level features or 'char' for character-level features.
use_idf
bool
default:"True"
Enable inverse-document-frequency reweighting.
smooth_idf
bool
default:"True"
Smooth IDF weights by adding one to document frequencies.

Methods

fit(X, y=None)

Learn vocabulary and IDF weights from training data.
X
iterable
required
An iterable of text documents (list of strings or pandas Series).
y
array-like
default:"None"
Target labels (not used, present for API consistency).
return
TfidfVectorizer
Returns self (the fitted vectorizer).

transform(X)

Transform documents to TF-IDF feature matrix.
X
iterable
required
An iterable of text documents to transform.
return
scipy.sparse.csr_matrix
TF-IDF-weighted document-term matrix with shape (n_documents, n_features).

fit_transform(X, y=None)

Learn vocabulary and IDF weights, then transform documents.
X
iterable
required
An iterable of text documents.
y
array-like
default:"None"
Target labels (not used).
return
scipy.sparse.csr_matrix
TF-IDF-weighted document-term matrix.

Vectorization Strategies

Character-based TF-IDF

Optimal for language detection due to character-level patterns unique to each language.
vectorizer = TfidfVectorizer(
    analyzer='char',
    ngram_range=(2, 4)
)

X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
Advantages:
  • Captures language-specific character patterns
  • Robust to spelling variations
  • Works well with short texts
Configuration:
  • analyzer='char' - Use character-level features
  • ngram_range=(2, 4) - Extract 2-grams, 3-grams, and 4-grams

Word-based TF-IDF

Uses word-level features with n-grams.
vectorizer = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 2),
    max_features=5000,
    min_df=2
)

X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
Advantages:
  • Captures word choice patterns
  • Lower dimensionality than character n-grams
  • Better interpretability
Configuration:
  • analyzer='word' - Use word-level features
  • ngram_range=(1, 2) - Extract unigrams and bigrams
  • max_features=5000 - Limit vocabulary size
  • min_df=2 - Remove rare words

Custom Vectorizers

LetterFrequencyVectorizer

A custom vectorizer that computes letter frequency distributions.
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class LetterFrequencyVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.alphabet = list("abcdefghijklmnopqrstuvwxyzáéíóúüñ")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        vectors = []
        for text in X:
            text = text.lower()
            total = len(text)
            freq = [text.count(c) / total if total > 0 else 0 for c in self.alphabet]
            vectors.append(freq)
        return np.array(vectors)
alphabet
list
default:"[a-z, á, é, í, ó, ú, ü, ñ]"
List of characters to compute frequencies for. Includes accented characters common in European languages.

Methods

fit(X, y=None) - No-op method for API consistency transform(X) - Computes letter frequency vectors
X
iterable
required
An iterable of text documents.
return
np.ndarray
Array of shape (n_documents, n_letters) containing normalized letter frequencies.

Example

vectorizer = LetterFrequencyVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train a classifier
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_vec, y_train)

Alternative Vectorizers

HashingVectorizer

Memory-efficient vectorization using hashing.
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(
    analyzer='char',
    ngram_range=(2, 4),
    n_features=2**18,
    alternate_sign=False
)

X_train_vec = vectorizer.transform(X_train)
X_val_vec = vectorizer.transform(X_val)
n_features
int
default:"2**20"
Number of features (hash buckets). Use 2**18 (262,144) for language detection.
alternate_sign
bool
default:"True"
Set to False to ensure all feature values are non-negative (required for some algorithms like Naive Bayes).
Advantages:
  • Constant memory footprint
  • No vocabulary needed
  • Fast transformation
Disadvantages:
  • Hash collisions may occur
  • Features are not interpretable
  • Cannot use inverse_transform()

CountVectorizer

Simple word count vectorization.
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    max_features=5000,
    min_df=2,
    ngram_range=(1, 2)
)
Similar to TfidfVectorizer but without IDF weighting.

Complete Vectorization Function

import time
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer

def vectorizar_datos(X_train, X_val, tipo_vectorizacion):
    """
    Applies specified vectorization and returns vectorized data.
    
    Args:
        X_train: Training texts
        X_val: Validation texts
        tipo_vectorizacion: Type of vectorization
            - 'chars-tf-idf': Character-based TF-IDF
            - 'words-tf-idf': Word-based TF-IDF
            - 'letter-frequency': Letter frequency distribution
            - 'hashing': Hashing vectorizer
    
    Returns:
        tuple: (X_train_vec, X_val_vec, vectorizer, tiempo_vectorizacion)
    """
    inicio = time.time()
    
    if tipo_vectorizacion == 'chars-tf-idf':
        vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 4))
    
    elif tipo_vectorizacion == 'words-tf-idf':
        vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))
    
    elif tipo_vectorizacion == 'letter-frequency':
        vectorizer = LetterFrequencyVectorizer()
    
    elif tipo_vectorizacion == 'hashing':
        vectorizer = HashingVectorizer(
            analyzer='char',
            ngram_range=(2, 4),
            n_features=2**18,
            alternate_sign=False
        )
    
    else:
        raise ValueError(f"Unknown vectorization type: '{tipo_vectorizacion}'")
    
    # Fit and transform
    X_train_vec = vectorizer.fit_transform(X_train)
    X_val_vec = vectorizer.transform(X_val)
    
    tiempo_vectorizacion = time.time() - inicio
    print(f"Vectorization time: {tiempo_vectorizacion:.2f} seconds")
    
    return X_train_vec, X_val_vec, vectorizer, tiempo_vectorizacion

Example Usage

X_train_vec, X_val_vec, vectorizer, _ = vectorizar_datos(
    X_train, X_val, 'chars-tf-idf'
)
print(f"Training matrix shape: {X_train_vec.shape}")
print(f"Validation matrix shape: {X_val_vec.shape}")

Performance Considerations

Memory Usage

  • TfidfVectorizer: Memory proportional to vocabulary size
  • HashingVectorizer: Fixed memory footprint
  • LetterFrequencyVectorizer: Minimal memory (34 features per document)

Speed

  • Character n-grams: Slower than word n-grams but more accurate
  • Hashing: Fastest transformation
  • Letter frequency: Fast but lower accuracy

Accuracy

For language detection:
  1. Character TF-IDF (highest accuracy)
  2. Word TF-IDF
  3. Letter frequency (lowest accuracy but fastest)

Build docs developers (and LLMs) love