Overview
TheTfidfVectorizer converts cleaned text into numerical TF-IDF (Term Frequency-Inverse Document Frequency) features that the Logistic Regression model can process. The configuration uses bi-gram features and vocabulary limiting to balance expressiveness with computational efficiency.
Vectorizer Initialization
fake_news_ia.py:82
Hyperparameters
Maximum number of features (vocabulary size) to extract from the corpus.What it does: After computing TF-IDF scores for all n-grams in the training data, only the top 5000 features with the highest TF-IDF scores across the corpus are retained. This limits the feature space to the most informative terms.Why 5000?
- Reduces dimensionality: Without this limit, the vocabulary could exceed 100,000+ terms, making the model slower and prone to overfitting
- Keeps signal: 5000 features capture the most discriminative vocabulary while filtering out rare, noisy terms
- Optimal performance: Empirically chosen to balance model accuracy (98.5%) with training/prediction speed
- Memory efficient: Sparse matrix remains manageable even with ~40,000 training samples
- Too low (e.g., 1000): May miss important discriminative features
- Too high (e.g., 20,000): Slower training, risk of overfitting on rare terms
Range of n-gram sizes to extract.
(1, 2) means both unigrams (single words) and bigrams (word pairs).What it does:- Unigrams (1): Individual words like
"president","announced","fake" - Bigrams (2): Word pairs like
"federal reserve","climate change","breaking news"
- Captures context: Bigrams preserve local word order and multi-word phrases that are strong indicators of real vs. fake news
- Real news patterns: Phrases like
"federal reserve","white house","according to"are common in legitimate journalism - Fake news patterns: Sensational phrases like
"you won't believe","secret meeting","doctors hate"help identify fake content - Balanced approach: Unigrams provide broad coverage while bigrams add contextual nuance
- Original text:
"The Federal Reserve announced interest rate changes" - Unigrams extracted:
federal,reserve,announced,interest,rate,changes - Bigrams extracted:
federal reserve,reserve announced,announced interest,interest rate,rate changes
(1, 1)(unigrams only): Faster but loses contextual information(1, 3)(trigrams): More context but much larger vocabulary, slower, risk of sparsity
Default Parameters (Not Explicitly Set)
Important default values that affect vectorizer behavior:Convert all text to lowercase before tokenization. This is redundant here since
limpiar_texto already lowercases text, but ensures consistency.Stopword removal strategy. Set to
None because stopwords are already filtered in the limpiar_texto function.Ignore terms that appear in more than this proportion of documents. The default (1.0) means no upper limit.
Ignore terms that appear in fewer than this number of documents. The default (1) includes even rare terms, relying on
max_features for filtering.Normalization applied to TF-IDF vectors. L2 normalization ensures all document vectors have unit length, making cosine similarity meaningful.
Enable inverse-document-frequency weighting. This downweights terms that appear frequently across many documents.
Training the Vectorizer
fake_news_ia.py:83
fit(): Learns vocabulary (top 5000 features) and IDF weights from the training corpustransform(): Converts each document into a sparse TF-IDF vector- Output: Sparse matrix of shape
(n_documents, 5000)
Using the Vectorizer for Prediction
app.py:63
Key Difference:
- Training:
vectorizer.fit_transform(texts)- Learns vocabulary AND transforms - Prediction:
vectorizer.transform([text])- Only transforms using learned vocabulary
Understanding TF-IDF Scores
Term Frequency (TF)
How often a term appears in a document, normalized by document length:Inverse Document Frequency (IDF)
How rare a term is across all documents:TF-IDF Score
Combination that highlights important terms:Feature Matrix Structure
Accessing Learned Vocabulary
Why TF-IDF for Fake News Detection?
Advantages
- Captures semantic importance: Not just word frequency, but how distinctive words are
- Reduces noise: Common words (“the”, “and”, “is”) automatically get low weights
- Sparse and efficient: Most articles use only a small subset of vocabulary
- Works well with Logistic Regression: Linear models perform excellently on TF-IDF features
- Interpretable: Feature weights correspond to actual words/phrases
Comparison with Alternatives
| Vectorization Method | Pros | Cons |
|---|---|---|
| TF-IDF | Fast, interpretable, proven | Ignores word order (beyond bi-grams) |
| Bag-of-Words | Simple | Doesn’t account for term importance |
| Word2Vec | Captures semantics | Requires pre-training, loses interpretability |
| BERT Embeddings | State-of-art semantics | Very slow, requires GPU, not interpretable |
Optimal Configuration Example
Persistence
fake_news_ia.py:101
Official Documentation
For complete API reference and advanced parameters: scikit-learn TfidfVectorizer DocumentationSee Also
- limpiar_texto Function - Text preprocessing before vectorization
- LogisticRegression Configuration - Model that uses TF-IDF features
- Model Training Workflow - Complete pipeline
- Prediction Function - Using the vectorizer for inference