Overview
Text vectorization transforms cleaned text into numerical features that the machine learning model can process. This project uses TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with optimized parameters for high performance.TF-IDF Vectorizer Configuration
The model uses scikit-learn’sTfidfVectorizer with specific parameters chosen for optimal accuracy:
fake_news_ia.py:82-83
Configuration Parameters
Maximum number of features (terms) to extract from the corpus. This limits the vocabulary size to the 5000 most important terms based on term frequency across documents.Why 5000? This value balances:
- Performance: Smaller feature space = faster training
- Accuracy: Captures the most discriminative terms
- Memory: Prevents excessive memory usage with sparse matrices
Range of n-gram sizes to extract.
(1, 2) means the vectorizer will extract both:- Unigrams (single words): e.g., “president”, “fake”, “news”
- Bigrams (two-word phrases): e.g., “breaking news”, “donald trump”, “fake news”
- “you won’t believe” (common in fake news)
- “according to” (common in real news)
- “sources say” (context-dependent)
How TF-IDF Works
Term Frequency (TF)
Measures how often a term appears in a document:Higher TF means the term is more important to that specific document.
Inverse Document Frequency (IDF)
Measures how rare a term is across all documents:Higher IDF means the term is more distinctive and informative.
Vectorization Process
Prepare Clean Text
Use the preprocessed text column that combines title and body:Reference:
fake_news_ia.py:78-79Fit and Transform
Train the vectorizer on the corpus and transform text to numerical features:This creates a sparse matrix where:
- Rows = documents (news articles)
- Columns = features (terms/n-grams)
- Values = TF-IDF scores
fake_news_ia.py:82-83Why TF-IDF Over Other Approaches?
TF-IDF vs. Bag of Words (BoW)
TF-IDF is superior to simple Bag of Words because it down-weights common terms (like “the”, “is”, “and”) that appear in many documents, while emphasizing distinctive terms that are more informative for classification.
| Feature | Bag of Words | TF-IDF |
|---|---|---|
| Common words | High weight | Low weight (penalized by IDF) |
| Rare distinctive words | Equal weight | High weight (boosted by IDF) |
| Accuracy for this task | ~92% | 98.5% |
TF-IDF vs. Word Embeddings (Word2Vec, GloVe)
- Speed: TF-IDF is much faster to train and predict
- Interpretability: TF-IDF weights are directly interpretable
- Resource requirements: No need for pre-trained embeddings or large models
- Performance: For this binary classification task, TF-IDF achieves 98.5% accuracy, which is comparable to more complex approaches
Feature Matrix Characteristics
Sparse Matrix Format
The TF-IDF output is a sparse matrix (CSR format by default):Sparse matrices only store non-zero values, making them memory-efficient. Most documents only use a small subset of the 5000 features, so sparse storage saves significant memory.
Feature Examples
Withngram_range=(1, 2), the vectorizer learns features like:
Unigrams (1-word):
- “trump”, “election”, “government”, “president”, “breaking”
- “fake news”, “white house”, “according officials”, “sources say”
Saving the Vectorizer
Performance Impact
The TF-IDF configuration contributes significantly to the model’s performance:- 5000 features: Optimal balance between accuracy and speed
- Bigrams (1,2): Captures phrase patterns, improving accuracy by ~3-4% over unigrams alone
- Sparse matrix: Enables efficient processing of large datasets
Data Splitting
After vectorization, the data is split into training and test sets:fake_news_ia.py:86-88
20% of data reserved for testing, 80% for training
Ensures reproducible splits across runs