Text Vectorization

Overview

Text vectorization transforms cleaned text into numerical features that the machine learning model can process. This project uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with optimized parameters for high performance.

TF-IDF Vectorizer Configuration

The model uses scikit-learn’s TfidfVectorizer with specific parameters chosen for optimal accuracy:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(X)

Reference: fake_news_ia.py:82-83

Configuration Parameters

max_features

int

default:"5000"

Maximum number of features (terms) to extract from the corpus. This limits the vocabulary size to the 5000 most important terms based on term frequency across documents.Why 5000? This value balances:

Performance: Smaller feature space = faster training
Accuracy: Captures the most discriminative terms
Memory: Prevents excessive memory usage with sparse matrices

ngram_range

tuple

default:"(1, 2)"

Range of n-gram sizes to extract. (1, 2) means the vectorizer will extract both:

Unigrams (single words): e.g., “president”, “fake”, “news”
Bigrams (two-word phrases): e.g., “breaking news”, “donald trump”, “fake news”

Why bigrams? Fake news often uses specific phrase patterns that single words can’t capture. For example:

“you won’t believe” (common in fake news)
“according to” (common in real news)
“sources say” (context-dependent)

How TF-IDF Works

Term Frequency (TF)

Measures how often a term appears in a document:

TF(term, document) = (Number of times term appears) / (Total terms in document)

Higher TF means the term is more important to that specific document.

Inverse Document Frequency (IDF)

Measures how rare a term is across all documents:

IDF(term) = log(Total documents / Documents containing term)

Higher IDF means the term is more distinctive and informative.

TF-IDF Score

Combines both metrics:

TF-IDF(term, document) = TF(term, document) × IDF(term)

This score is high when a term appears frequently in a document but rarely across the corpus, making it discriminative.

Vectorization Process

Prepare Clean Text

Use the preprocessed text column that combines title and body:

X = df["clean_text"]  # Cleaned text from preprocessing
y = df["label"]       # Target labels (fake/real)

Reference: fake_news_ia.py:78-79

Fit and Transform

Train the vectorizer on the corpus and transform text to numerical features:

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(X)

This creates a sparse matrix where:

Rows = documents (news articles)
Columns = features (terms/n-grams)
Values = TF-IDF scores

Reference: fake_news_ia.py:82-83

Verify Feature Count

After vectorization, the feature matrix has:

Shape: (number_of_articles, 5000)

Example output:

Tamaño del set de entrenamiento: 35200 (5000 features)

Reference: fake_news_ia.py:89

Why TF-IDF Over Other Approaches?

TF-IDF vs. Bag of Words (BoW)

TF-IDF is superior to simple Bag of Words because it down-weights common terms (like “the”, “is”, “and”) that appear in many documents, while emphasizing distinctive terms that are more informative for classification.

Feature	Bag of Words	TF-IDF
Common words	High weight	Low weight (penalized by IDF)
Rare distinctive words	Equal weight	High weight (boosted by IDF)
Accuracy for this task	~92%	98.5%

TF-IDF vs. Word Embeddings (Word2Vec, GloVe)

Speed: TF-IDF is much faster to train and predict
Interpretability: TF-IDF weights are directly interpretable
Resource requirements: No need for pre-trained embeddings or large models
Performance: For this binary classification task, TF-IDF achieves 98.5% accuracy, which is comparable to more complex approaches

Feature Matrix Characteristics

Sparse Matrix Format

The TF-IDF output is a sparse matrix (CSR format by default):

print(type(X_tfidf))  # <class 'scipy.sparse.csr_matrix'>
print(X_tfidf.shape)  # (44000, 5000) - depends on dataset size

Sparse matrices only store non-zero values, making them memory-efficient. Most documents only use a small subset of the 5000 features, so sparse storage saves significant memory.

Feature Examples

With ngram_range=(1, 2), the vectorizer learns features like: Unigrams (1-word):

“trump”, “election”, “government”, “president”, “breaking”

Bigrams (2-word):

“fake news”, “white house”, “according officials”, “sources say”

Saving the Vectorizer

Critical: The vectorizer must be saved and reused for predictions on new data to ensure the same feature space:

import joblib
joblib.dump(vectorizer, 'vectorizer_tfidf.pkl')

Reference: fake_news_ia.py:101When making predictions on new articles, always use the saved vectorizer:

vectorizer = joblib.load('vectorizer_tfidf.pkl')
new_text_vec = vectorizer.transform([new_article])  # Use transform, NOT fit_transform

Reference: fake_news_ia.py:128

Performance Impact

The TF-IDF configuration contributes significantly to the model’s performance:

5000 features: Optimal balance between accuracy and speed
Bigrams (1,2): Captures phrase patterns, improving accuracy by ~3-4% over unigrams alone
Sparse matrix: Enables efficient processing of large datasets

Data Splitting

After vectorization, the data is split into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

Reference: fake_news_ia.py:86-88

test_size

float

default:"0.2"

20% of data reserved for testing, 80% for training

random_state

int

default:"42"

Ensures reproducible splits across runs

Next Steps

After vectorization, proceed to Model Configuration to learn about the logistic regression setup and training process.

Get Started

Core Concepts

Training Guide

Inference

Advanced

Overview

TF-IDF Vectorizer Configuration

Configuration Parameters

How TF-IDF Works

Vectorization Process

Why TF-IDF Over Other Approaches?

TF-IDF vs. Bag of Words (BoW)

TF-IDF vs. Word Embeddings (Word2Vec, GloVe)

Feature Matrix Characteristics

Sparse Matrix Format

Feature Examples

Saving the Vectorizer

Performance Impact

Data Splitting

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guide

Inference

Advanced

​Overview

​TF-IDF Vectorizer Configuration

​Configuration Parameters

​How TF-IDF Works

​Vectorization Process

​Why TF-IDF Over Other Approaches?

​TF-IDF vs. Bag of Words (BoW)

​TF-IDF vs. Word Embeddings (Word2Vec, GloVe)

​Feature Matrix Characteristics

​Sparse Matrix Format

​Feature Examples

​Saving the Vectorizer

​Performance Impact

​Data Splitting

​Next Steps

Build docs developers (and LLMs) love

Overview

TF-IDF Vectorizer Configuration

Configuration Parameters

How TF-IDF Works

Vectorization Process

Why TF-IDF Over Other Approaches?

TF-IDF vs. Bag of Words (BoW)

TF-IDF vs. Word Embeddings (Word2Vec, GloVe)

Feature Matrix Characteristics

Sparse Matrix Format

Feature Examples

Saving the Vectorizer

Performance Impact

Data Splitting

Next Steps