Vectorization API

Overview

The vectorization module provides methods to convert preprocessed text into numerical feature vectors suitable for machine learning models. The system supports multiple vectorization strategies optimized for language detection.

TF-IDF Vectorization

TfidfVectorizer

The primary vectorization method using Term Frequency-Inverse Document Frequency (TF-IDF).

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    min_df=2,
    max_df=0.95,
    ngram_range=(1, 2),
    analyzer='word',
    use_idf=True,
    smooth_idf=True
)

Parameters

max_features

int

default:"None"

Maximum number of features (vocabulary size). Commonly set to 5000 for language detection tasks.

min_df

int | float

default:"1"

Minimum document frequency. Features appearing in fewer documents are ignored. Recommended value: 2.

max_df

float

default:"1.0"

Maximum document frequency. Features appearing in more than this proportion of documents are ignored. Recommended value: 0.95 to filter common words.

ngram_range

tuple

default:"(1, 1)"

Range of n-grams to extract. (1, 2) extracts unigrams and bigrams.

analyzer

str

default:"'word'"

Analysis level: 'word' for word-level features or 'char' for character-level features.

use_idf

bool

default:"True"

Enable inverse-document-frequency reweighting.

smooth_idf

bool

default:"True"

Smooth IDF weights by adding one to document frequencies.

Methods

fit(X, y=None)

Learn vocabulary and IDF weights from training data.

iterable

required

An iterable of text documents (list of strings or pandas Series).

array-like

default:"None"

Target labels (not used, present for API consistency).

return

TfidfVectorizer

Returns self (the fitted vectorizer).

transform(X)

Transform documents to TF-IDF feature matrix.

iterable

required

An iterable of text documents to transform.

return

scipy.sparse.csr_matrix

TF-IDF-weighted document-term matrix with shape (n_documents, n_features).

fit_transform(X, y=None)

Learn vocabulary and IDF weights, then transform documents.

iterable

required

An iterable of text documents.

array-like

default:"None"

Target labels (not used).

return

scipy.sparse.csr_matrix

TF-IDF-weighted document-term matrix.

Vectorization Strategies

Character-based TF-IDF

Optimal for language detection due to character-level patterns unique to each language.

vectorizer = TfidfVectorizer(
    analyzer='char',
    ngram_range=(2, 4)
)

X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

Advantages:

Captures language-specific character patterns
Robust to spelling variations
Works well with short texts

Configuration:

analyzer='char' - Use character-level features
ngram_range=(2, 4) - Extract 2-grams, 3-grams, and 4-grams

Word-based TF-IDF

Uses word-level features with n-grams.

vectorizer = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 2),
    max_features=5000,
    min_df=2
)

X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

Advantages:

Captures word choice patterns
Lower dimensionality than character n-grams
Better interpretability

Configuration:

analyzer='word' - Use word-level features
ngram_range=(1, 2) - Extract unigrams and bigrams
max_features=5000 - Limit vocabulary size
min_df=2 - Remove rare words

Custom Vectorizers

LetterFrequencyVectorizer

A custom vectorizer that computes letter frequency distributions.

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class LetterFrequencyVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.alphabet = list("abcdefghijklmnopqrstuvwxyzáéíóúüñ")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        vectors = []
        for text in X:
            text = text.lower()
            total = len(text)
            freq = [text.count(c) / total if total > 0 else 0 for c in self.alphabet]
            vectors.append(freq)
        return np.array(vectors)

alphabet

list

default:"[a-z, á, é, í, ó, ú, ü, ñ]"

List of characters to compute frequencies for. Includes accented characters common in European languages.

Methods

fit(X, y=None) - No-op method for API consistency transform(X) - Computes letter frequency vectors

iterable

required

An iterable of text documents.

return

np.ndarray

Array of shape (n_documents, n_letters) containing normalized letter frequencies.

Example

vectorizer = LetterFrequencyVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train a classifier
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_vec, y_train)

Alternative Vectorizers

HashingVectorizer

Memory-efficient vectorization using hashing.

from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(
    analyzer='char',
    ngram_range=(2, 4),
    n_features=2**18,
    alternate_sign=False
)

X_train_vec = vectorizer.transform(X_train)
X_val_vec = vectorizer.transform(X_val)

n_features

int

default:"2**20"

Number of features (hash buckets). Use 2**18 (262,144) for language detection.

alternate_sign

bool

default:"True"

Set to False to ensure all feature values are non-negative (required for some algorithms like Naive Bayes).

Advantages:

Constant memory footprint
No vocabulary needed
Fast transformation

Disadvantages:

Hash collisions may occur
Features are not interpretable
Cannot use inverse_transform()

CountVectorizer

Simple word count vectorization.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    max_features=5000,
    min_df=2,
    ngram_range=(1, 2)
)

Similar to TfidfVectorizer but without IDF weighting.

Complete Vectorization Function

import time
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer

def vectorizar_datos(X_train, X_val, tipo_vectorizacion):
    """
    Applies specified vectorization and returns vectorized data.
    
    Args:
        X_train: Training texts
        X_val: Validation texts
        tipo_vectorizacion: Type of vectorization
            - 'chars-tf-idf': Character-based TF-IDF
            - 'words-tf-idf': Word-based TF-IDF
            - 'letter-frequency': Letter frequency distribution
            - 'hashing': Hashing vectorizer
    
    Returns:
        tuple: (X_train_vec, X_val_vec, vectorizer, tiempo_vectorizacion)
    """
    inicio = time.time()
    
    if tipo_vectorizacion == 'chars-tf-idf':
        vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 4))
    
    elif tipo_vectorizacion == 'words-tf-idf':
        vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))
    
    elif tipo_vectorizacion == 'letter-frequency':
        vectorizer = LetterFrequencyVectorizer()
    
    elif tipo_vectorizacion == 'hashing':
        vectorizer = HashingVectorizer(
            analyzer='char',
            ngram_range=(2, 4),
            n_features=2**18,
            alternate_sign=False
        )
    
    else:
        raise ValueError(f"Unknown vectorization type: '{tipo_vectorizacion}'")
    
    # Fit and transform
    X_train_vec = vectorizer.fit_transform(X_train)
    X_val_vec = vectorizer.transform(X_val)
    
    tiempo_vectorizacion = time.time() - inicio
    print(f"Vectorization time: {tiempo_vectorizacion:.2f} seconds")
    
    return X_train_vec, X_val_vec, vectorizer, tiempo_vectorizacion

Example Usage

X_train_vec, X_val_vec, vectorizer, _ = vectorizar_datos(
    X_train, X_val, 'chars-tf-idf'
)
print(f"Training matrix shape: {X_train_vec.shape}")
print(f"Validation matrix shape: {X_val_vec.shape}")

Performance Considerations

Memory Usage

TfidfVectorizer: Memory proportional to vocabulary size
HashingVectorizer: Fixed memory footprint
LetterFrequencyVectorizer: Minimal memory (34 features per document)

Speed

Character n-grams: Slower than word n-grams but more accurate
Hashing: Fastest transformation
Letter frequency: Fast but lower accuracy

Accuracy

For language detection:

Character TF-IDF (highest accuracy)
Word TF-IDF
Letter frequency (lowest accuracy but fastest)

Preprocessing API - Prepare text before vectorization
Models API - Train classifiers on vectorized features
Training Guide - Optimize vectorization parameters

Pipeline

Overview

TF-IDF Vectorization

TfidfVectorizer

Parameters

Methods

fit(X, y=None)

transform(X)

fit_transform(X, y=None)

Vectorization Strategies

Character-based TF-IDF

Word-based TF-IDF

Custom Vectorizers

LetterFrequencyVectorizer

Methods

Example

Alternative Vectorizers

HashingVectorizer

CountVectorizer

Complete Vectorization Function

Example Usage

Performance Considerations

Memory Usage

Speed

Accuracy

Build docs developers (and LLMs) love

Pipeline

​Overview

​TF-IDF Vectorization

​TfidfVectorizer

​Parameters

​Methods

​fit(X, y=None)

​transform(X)

​fit_transform(X, y=None)

​Vectorization Strategies

​Character-based TF-IDF

​Word-based TF-IDF

​Custom Vectorizers

​LetterFrequencyVectorizer

​Methods

​Example

​Alternative Vectorizers

​HashingVectorizer

​CountVectorizer

​Complete Vectorization Function

​Example Usage

​Performance Considerations

​Memory Usage

​Speed

​Accuracy

​Related Documentation

Build docs developers (and LLMs) love

Overview

TF-IDF Vectorization

TfidfVectorizer

Parameters

Methods

fit(X, y=None)

transform(X)

fit_transform(X, y=None)

Vectorization Strategies

Character-based TF-IDF

Word-based TF-IDF

Custom Vectorizers

LetterFrequencyVectorizer

Methods

Example

Alternative Vectorizers

HashingVectorizer

CountVectorizer

Complete Vectorization Function

Example Usage

Performance Considerations

Memory Usage

Speed

Accuracy

Related Documentation