TfidfVectorizer Configuration

Overview

The TfidfVectorizer converts cleaned text into numerical TF-IDF (Term Frequency-Inverse Document Frequency) features that the Logistic Regression model can process. The configuration uses bi-gram features and vocabulary limiting to balance expressiveness with computational efficiency.

Vectorizer Initialization

fake_news_ia.py:82

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

Hyperparameters

max_features

int

default:"5000"

Maximum number of features (vocabulary size) to extract from the corpus.What it does: After computing TF-IDF scores for all n-grams in the training data, only the top 5000 features with the highest TF-IDF scores across the corpus are retained. This limits the feature space to the most informative terms.Why 5000?

Reduces dimensionality: Without this limit, the vocabulary could exceed 100,000+ terms, making the model slower and prone to overfitting
Keeps signal: 5000 features capture the most discriminative vocabulary while filtering out rare, noisy terms
Optimal performance: Empirically chosen to balance model accuracy (98.5%) with training/prediction speed
Memory efficient: Sparse matrix remains manageable even with ~40,000 training samples

Trade-offs:

Too low (e.g., 1000): May miss important discriminative features
Too high (e.g., 20,000): Slower training, risk of overfitting on rare terms

ngram_range

tuple

default:"(1, 2)"

Range of n-gram sizes to extract. (1, 2) means both unigrams (single words) and bigrams (word pairs).What it does:

Unigrams (1): Individual words like "president", "announced", "fake"
Bigrams (2): Word pairs like "federal reserve", "climate change", "breaking news"

Why (1, 2)?

Captures context: Bigrams preserve local word order and multi-word phrases that are strong indicators of real vs. fake news
Real news patterns: Phrases like "federal reserve", "white house", "according to" are common in legitimate journalism
Fake news patterns: Sensational phrases like "you won't believe", "secret meeting", "doctors hate" help identify fake content
Balanced approach: Unigrams provide broad coverage while bigrams add contextual nuance

Examples:

Original text: "The Federal Reserve announced interest rate changes"
Unigrams extracted: federal, reserve, announced, interest, rate, changes
Bigrams extracted: federal reserve, reserve announced, announced interest, interest rate, rate changes

Trade-offs:

(1, 1) (unigrams only): Faster but loses contextual information
(1, 3) (trigrams): More context but much larger vocabulary, slower, risk of sparsity

Default Parameters (Not Explicitly Set)

Important default values that affect vectorizer behavior:

lowercase

bool

default:"True"

Convert all text to lowercase before tokenization. This is redundant here since limpiar_texto already lowercases text, but ensures consistency.

stop_words

string or list

default:"None"

Stopword removal strategy. Set to None because stopwords are already filtered in the limpiar_texto function.

max_df

float

default:"1.0"

Ignore terms that appear in more than this proportion of documents. The default (1.0) means no upper limit.

min_df

int

default:"1"

Ignore terms that appear in fewer than this number of documents. The default (1) includes even rare terms, relying on max_features for filtering.

norm

string

default:"'l2'"

Normalization applied to TF-IDF vectors. L2 normalization ensures all document vectors have unit length, making cosine similarity meaningful.

use_idf

bool

default:"True"

Enable inverse-document-frequency weighting. This downweights terms that appear frequently across many documents.

Training the Vectorizer

fake_news_ia.py:83

X = df["clean_text"]
X_tfidf = vectorizer.fit_transform(X)

Process:

fit(): Learns vocabulary (top 5000 features) and IDF weights from the training corpus
transform(): Converts each document into a sparse TF-IDF vector
Output: Sparse matrix of shape (n_documents, 5000)

Only fit on training data: fit_transform() should only be called during training. For prediction, use transform() to apply the learned vocabulary.

Using the Vectorizer for Prediction

app.py:63

# Transform new text using learned vocabulary (DO NOT fit again)
noticia_vec = vectorizer.transform([noticia_limpia])

Key Difference:

Training: vectorizer.fit_transform(texts) - Learns vocabulary AND transforms
Prediction: vectorizer.transform([text]) - Only transforms using learned vocabulary

Understanding TF-IDF Scores

Term Frequency (TF)

How often a term appears in a document, normalized by document length:

TF(term, doc) = (count of term in doc) / (total terms in doc)

Inverse Document Frequency (IDF)

How rare a term is across all documents:

IDF(term) = log((total documents) / (documents containing term))

TF-IDF Score

Combination that highlights important terms:

TF-IDF(term, doc) = TF(term, doc) × IDF(term)

Result: Common words like “the” (appears in every document) get low scores, while distinctive terms get high scores.

Feature Matrix Structure

# After vectorization
print(X_tfidf.shape)  # (44,898, 5000)
# 44,898 news articles
# 5,000 features (unigrams + bigrams)

# Sparse matrix (most values are 0)
print(f"Sparsity: {(1.0 - X_tfidf.nnz / (X_tfidf.shape[0] * X_tfidf.shape[1])):.2%}")
# Typical sparsity: ~99% (each article uses only ~1% of vocabulary)

Accessing Learned Vocabulary

# Get feature names
features = vectorizer.get_feature_names_out()
print(f"Total features: {len(features)}")
print(f"First 10 features: {features[:10]}")

# Example output:
# ['according', 'according officials', 'announced', 'breaking', 'breaking news', ...]

# Get IDF scores for each feature
idf_scores = vectorizer.idf_
for feature, idf in zip(features[:10], idf_scores[:10]):
    print(f"{feature}: {idf:.4f}")

Why TF-IDF for Fake News Detection?

Advantages

Captures semantic importance: Not just word frequency, but how distinctive words are
Reduces noise: Common words (“the”, “and”, “is”) automatically get low weights
Sparse and efficient: Most articles use only a small subset of vocabulary
Works well with Logistic Regression: Linear models perform excellently on TF-IDF features
Interpretable: Feature weights correspond to actual words/phrases

Comparison with Alternatives

Vectorization Method	Pros	Cons
TF-IDF	Fast, interpretable, proven	Ignores word order (beyond bi-grams)
Bag-of-Words	Simple	Doesn’t account for term importance
Word2Vec	Captures semantics	Requires pre-training, loses interpretability
BERT Embeddings	State-of-art semantics	Very slow, requires GPU, not interpretable

Optimal Configuration Example

from sklearn.feature_extraction.text import TfidfVectorizer

# Production-ready configuration for fake news detection
vectorizer = TfidfVectorizer(
    max_features=5000,      # Top 5000 most important terms
    ngram_range=(1, 2),     # Unigrams + bigrams
    lowercase=True,         # Normalize case (redundant with limpiar_texto)
    max_df=0.95,            # Ignore terms in >95% of docs (optional improvement)
    min_df=2,               # Ignore terms in <2 docs (optional improvement)
    norm='l2',              # L2 normalization
    use_idf=True            # Enable IDF weighting
)

# Fit on training data only
vectorizer.fit(X_train)

# Transform both train and test
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Persistence

fake_news_ia.py:101

import joblib

# Save fitted vectorizer
joblib.dump(vectorizer, 'vectorizer_tfidf.pkl')

# Load for prediction
vectorizer = joblib.load('vectorizer_tfidf.pkl')

Critical for Production: The saved vectorizer must be loaded during prediction to ensure the same vocabulary and IDF weights are used. Creating a new vectorizer will produce incompatible features.

Official Documentation

For complete API reference and advanced parameters: scikit-learn TfidfVectorizer Documentation

Core Functions

Models

Overview

Vectorizer Initialization

Hyperparameters

Default Parameters (Not Explicitly Set)

Training the Vectorizer

Using the Vectorizer for Prediction

Understanding TF-IDF Scores

Term Frequency (TF)

Inverse Document Frequency (IDF)

TF-IDF Score

Feature Matrix Structure

Accessing Learned Vocabulary

Why TF-IDF for Fake News Detection?

Advantages

Comparison with Alternatives

Optimal Configuration Example

Persistence

Official Documentation

See Also

Build docs developers (and LLMs) love

Core Functions

Models

​Overview

​Vectorizer Initialization

​Hyperparameters

​Default Parameters (Not Explicitly Set)

​Training the Vectorizer

​Using the Vectorizer for Prediction

​Understanding TF-IDF Scores

​Term Frequency (TF)

​Inverse Document Frequency (IDF)

​TF-IDF Score

​Feature Matrix Structure

​Accessing Learned Vocabulary

​Why TF-IDF for Fake News Detection?

​Advantages

​Comparison with Alternatives

​Optimal Configuration Example

​Persistence

​Official Documentation

​See Also

Build docs developers (and LLMs) love

Overview

Vectorizer Initialization

Hyperparameters

Default Parameters (Not Explicitly Set)

Training the Vectorizer

Using the Vectorizer for Prediction

Understanding TF-IDF Scores

Term Frequency (TF)

Inverse Document Frequency (IDF)

TF-IDF Score

Feature Matrix Structure

Accessing Learned Vocabulary

Why TF-IDF for Fake News Detection?

Advantages

Comparison with Alternatives

Optimal Configuration Example

Persistence

Official Documentation

See Also