Skip to main content

Overview

The TfidfVectorizer converts cleaned text into numerical TF-IDF (Term Frequency-Inverse Document Frequency) features that the Logistic Regression model can process. The configuration uses bi-gram features and vocabulary limiting to balance expressiveness with computational efficiency.

Vectorizer Initialization

fake_news_ia.py:82
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

Hyperparameters

max_features
int
default:"5000"
Maximum number of features (vocabulary size) to extract from the corpus.What it does: After computing TF-IDF scores for all n-grams in the training data, only the top 5000 features with the highest TF-IDF scores across the corpus are retained. This limits the feature space to the most informative terms.Why 5000?
  • Reduces dimensionality: Without this limit, the vocabulary could exceed 100,000+ terms, making the model slower and prone to overfitting
  • Keeps signal: 5000 features capture the most discriminative vocabulary while filtering out rare, noisy terms
  • Optimal performance: Empirically chosen to balance model accuracy (98.5%) with training/prediction speed
  • Memory efficient: Sparse matrix remains manageable even with ~40,000 training samples
Trade-offs:
  • Too low (e.g., 1000): May miss important discriminative features
  • Too high (e.g., 20,000): Slower training, risk of overfitting on rare terms
ngram_range
tuple
default:"(1, 2)"
Range of n-gram sizes to extract. (1, 2) means both unigrams (single words) and bigrams (word pairs).What it does:
  • Unigrams (1): Individual words like "president", "announced", "fake"
  • Bigrams (2): Word pairs like "federal reserve", "climate change", "breaking news"
Why (1, 2)?
  • Captures context: Bigrams preserve local word order and multi-word phrases that are strong indicators of real vs. fake news
  • Real news patterns: Phrases like "federal reserve", "white house", "according to" are common in legitimate journalism
  • Fake news patterns: Sensational phrases like "you won't believe", "secret meeting", "doctors hate" help identify fake content
  • Balanced approach: Unigrams provide broad coverage while bigrams add contextual nuance
Examples:
  • Original text: "The Federal Reserve announced interest rate changes"
  • Unigrams extracted: federal, reserve, announced, interest, rate, changes
  • Bigrams extracted: federal reserve, reserve announced, announced interest, interest rate, rate changes
Trade-offs:
  • (1, 1) (unigrams only): Faster but loses contextual information
  • (1, 3) (trigrams): More context but much larger vocabulary, slower, risk of sparsity

Default Parameters (Not Explicitly Set)

Important default values that affect vectorizer behavior:
lowercase
bool
default:"True"
Convert all text to lowercase before tokenization. This is redundant here since limpiar_texto already lowercases text, but ensures consistency.
stop_words
string or list
default:"None"
Stopword removal strategy. Set to None because stopwords are already filtered in the limpiar_texto function.
max_df
float
default:"1.0"
Ignore terms that appear in more than this proportion of documents. The default (1.0) means no upper limit.
min_df
int
default:"1"
Ignore terms that appear in fewer than this number of documents. The default (1) includes even rare terms, relying on max_features for filtering.
norm
string
default:"'l2'"
Normalization applied to TF-IDF vectors. L2 normalization ensures all document vectors have unit length, making cosine similarity meaningful.
use_idf
bool
default:"True"
Enable inverse-document-frequency weighting. This downweights terms that appear frequently across many documents.

Training the Vectorizer

fake_news_ia.py:83
X = df["clean_text"]
X_tfidf = vectorizer.fit_transform(X)
Process:
  1. fit(): Learns vocabulary (top 5000 features) and IDF weights from the training corpus
  2. transform(): Converts each document into a sparse TF-IDF vector
  3. Output: Sparse matrix of shape (n_documents, 5000)
Only fit on training data: fit_transform() should only be called during training. For prediction, use transform() to apply the learned vocabulary.

Using the Vectorizer for Prediction

app.py:63
# Transform new text using learned vocabulary (DO NOT fit again)
noticia_vec = vectorizer.transform([noticia_limpia])
Key Difference:
  • Training: vectorizer.fit_transform(texts) - Learns vocabulary AND transforms
  • Prediction: vectorizer.transform([text]) - Only transforms using learned vocabulary

Understanding TF-IDF Scores

Term Frequency (TF)

How often a term appears in a document, normalized by document length:
TF(term, doc) = (count of term in doc) / (total terms in doc)

Inverse Document Frequency (IDF)

How rare a term is across all documents:
IDF(term) = log((total documents) / (documents containing term))

TF-IDF Score

Combination that highlights important terms:
TF-IDF(term, doc) = TF(term, doc) × IDF(term)
Result: Common words like “the” (appears in every document) get low scores, while distinctive terms get high scores.

Feature Matrix Structure

# After vectorization
print(X_tfidf.shape)  # (44,898, 5000)
# 44,898 news articles
# 5,000 features (unigrams + bigrams)

# Sparse matrix (most values are 0)
print(f"Sparsity: {(1.0 - X_tfidf.nnz / (X_tfidf.shape[0] * X_tfidf.shape[1])):.2%}")
# Typical sparsity: ~99% (each article uses only ~1% of vocabulary)

Accessing Learned Vocabulary

# Get feature names
features = vectorizer.get_feature_names_out()
print(f"Total features: {len(features)}")
print(f"First 10 features: {features[:10]}")

# Example output:
# ['according', 'according officials', 'announced', 'breaking', 'breaking news', ...]

# Get IDF scores for each feature
idf_scores = vectorizer.idf_
for feature, idf in zip(features[:10], idf_scores[:10]):
    print(f"{feature}: {idf:.4f}")

Why TF-IDF for Fake News Detection?

Advantages

  1. Captures semantic importance: Not just word frequency, but how distinctive words are
  2. Reduces noise: Common words (“the”, “and”, “is”) automatically get low weights
  3. Sparse and efficient: Most articles use only a small subset of vocabulary
  4. Works well with Logistic Regression: Linear models perform excellently on TF-IDF features
  5. Interpretable: Feature weights correspond to actual words/phrases

Comparison with Alternatives

Vectorization MethodProsCons
TF-IDFFast, interpretable, provenIgnores word order (beyond bi-grams)
Bag-of-WordsSimpleDoesn’t account for term importance
Word2VecCaptures semanticsRequires pre-training, loses interpretability
BERT EmbeddingsState-of-art semanticsVery slow, requires GPU, not interpretable

Optimal Configuration Example

from sklearn.feature_extraction.text import TfidfVectorizer

# Production-ready configuration for fake news detection
vectorizer = TfidfVectorizer(
    max_features=5000,      # Top 5000 most important terms
    ngram_range=(1, 2),     # Unigrams + bigrams
    lowercase=True,         # Normalize case (redundant with limpiar_texto)
    max_df=0.95,            # Ignore terms in >95% of docs (optional improvement)
    min_df=2,               # Ignore terms in <2 docs (optional improvement)
    norm='l2',              # L2 normalization
    use_idf=True            # Enable IDF weighting
)

# Fit on training data only
vectorizer.fit(X_train)

# Transform both train and test
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Persistence

fake_news_ia.py:101
import joblib

# Save fitted vectorizer
joblib.dump(vectorizer, 'vectorizer_tfidf.pkl')

# Load for prediction
vectorizer = joblib.load('vectorizer_tfidf.pkl')
Critical for Production: The saved vectorizer must be loaded during prediction to ensure the same vocabulary and IDF weights are used. Creating a new vectorizer will produce incompatible features.

Official Documentation

For complete API reference and advanced parameters: scikit-learn TfidfVectorizer Documentation

See Also

Build docs developers (and LLMs) love