Skip to main content

Overview

Text vectorization transforms cleaned text into numerical features that the machine learning model can process. This project uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with optimized parameters for high performance.

TF-IDF Vectorizer Configuration

The model uses scikit-learn’s TfidfVectorizer with specific parameters chosen for optimal accuracy:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(X)
Reference: fake_news_ia.py:82-83

Configuration Parameters

max_features
int
default:"5000"
Maximum number of features (terms) to extract from the corpus. This limits the vocabulary size to the 5000 most important terms based on term frequency across documents.Why 5000? This value balances:
  • Performance: Smaller feature space = faster training
  • Accuracy: Captures the most discriminative terms
  • Memory: Prevents excessive memory usage with sparse matrices
ngram_range
tuple
default:"(1, 2)"
Range of n-gram sizes to extract. (1, 2) means the vectorizer will extract both:
  • Unigrams (single words): e.g., “president”, “fake”, “news”
  • Bigrams (two-word phrases): e.g., “breaking news”, “donald trump”, “fake news”
Why bigrams? Fake news often uses specific phrase patterns that single words can’t capture. For example:
  • “you won’t believe” (common in fake news)
  • “according to” (common in real news)
  • “sources say” (context-dependent)

How TF-IDF Works

1

Term Frequency (TF)

Measures how often a term appears in a document:
TF(term, document) = (Number of times term appears) / (Total terms in document)
Higher TF means the term is more important to that specific document.
2

Inverse Document Frequency (IDF)

Measures how rare a term is across all documents:
IDF(term) = log(Total documents / Documents containing term)
Higher IDF means the term is more distinctive and informative.
3

TF-IDF Score

Combines both metrics:
TF-IDF(term, document) = TF(term, document) × IDF(term)
This score is high when a term appears frequently in a document but rarely across the corpus, making it discriminative.

Vectorization Process

1

Prepare Clean Text

Use the preprocessed text column that combines title and body:
X = df["clean_text"]  # Cleaned text from preprocessing
y = df["label"]       # Target labels (fake/real)
Reference: fake_news_ia.py:78-79
2

Fit and Transform

Train the vectorizer on the corpus and transform text to numerical features:
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(X)
This creates a sparse matrix where:
  • Rows = documents (news articles)
  • Columns = features (terms/n-grams)
  • Values = TF-IDF scores
Reference: fake_news_ia.py:82-83
3

Verify Feature Count

After vectorization, the feature matrix has:
Shape: (number_of_articles, 5000)
Example output:
Tamaño del set de entrenamiento: 35200 (5000 features)
Reference: fake_news_ia.py:89

Why TF-IDF Over Other Approaches?

TF-IDF vs. Bag of Words (BoW)

TF-IDF is superior to simple Bag of Words because it down-weights common terms (like “the”, “is”, “and”) that appear in many documents, while emphasizing distinctive terms that are more informative for classification.
FeatureBag of WordsTF-IDF
Common wordsHigh weightLow weight (penalized by IDF)
Rare distinctive wordsEqual weightHigh weight (boosted by IDF)
Accuracy for this task~92%98.5%

TF-IDF vs. Word Embeddings (Word2Vec, GloVe)

  • Speed: TF-IDF is much faster to train and predict
  • Interpretability: TF-IDF weights are directly interpretable
  • Resource requirements: No need for pre-trained embeddings or large models
  • Performance: For this binary classification task, TF-IDF achieves 98.5% accuracy, which is comparable to more complex approaches

Feature Matrix Characteristics

Sparse Matrix Format

The TF-IDF output is a sparse matrix (CSR format by default):
print(type(X_tfidf))  # <class 'scipy.sparse.csr_matrix'>
print(X_tfidf.shape)  # (44000, 5000) - depends on dataset size
Sparse matrices only store non-zero values, making them memory-efficient. Most documents only use a small subset of the 5000 features, so sparse storage saves significant memory.

Feature Examples

With ngram_range=(1, 2), the vectorizer learns features like: Unigrams (1-word):
  • “trump”, “election”, “government”, “president”, “breaking”
Bigrams (2-word):
  • “fake news”, “white house”, “according officials”, “sources say”

Saving the Vectorizer

Critical: The vectorizer must be saved and reused for predictions on new data to ensure the same feature space:
import joblib
joblib.dump(vectorizer, 'vectorizer_tfidf.pkl')
Reference: fake_news_ia.py:101When making predictions on new articles, always use the saved vectorizer:
vectorizer = joblib.load('vectorizer_tfidf.pkl')
new_text_vec = vectorizer.transform([new_article])  # Use transform, NOT fit_transform
Reference: fake_news_ia.py:128

Performance Impact

The TF-IDF configuration contributes significantly to the model’s performance:
  • 5000 features: Optimal balance between accuracy and speed
  • Bigrams (1,2): Captures phrase patterns, improving accuracy by ~3-4% over unigrams alone
  • Sparse matrix: Enables efficient processing of large datasets

Data Splitting

After vectorization, the data is split into training and test sets:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)
Reference: fake_news_ia.py:86-88
test_size
float
default:"0.2"
20% of data reserved for testing, 80% for training
random_state
int
default:"42"
Ensures reproducible splits across runs

Next Steps

After vectorization, proceed to Model Configuration to learn about the logistic regression setup and training process.

Build docs developers (and LLMs) love