Overview
The model training phase converts preprocessed text into numerical features, trains a Logistic Regression classifier, and evaluates its performance. This stage is where the system achieves its impressive 98.5% accuracy on fake news detection.Feature Extraction: TF-IDF Vectorization
What is TF-IDF?
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection:- TF (Term Frequency) - How often a word appears in a document
- IDF (Inverse Document Frequency) - How rare/common a word is across all documents
Why TF-IDF is perfect for fake news detection
Why TF-IDF is perfect for fake news detection
TF-IDF identifies words that are distinctive to fake vs. real news:
- High TF-IDF - Words that appear frequently in one article but rarely elsewhere (distinctive)
- Low TF-IDF - Common words that appear everywhere (less informative)
- “president” might appear in both fake and real news (low discrimination)
- “shocking” or “revealed” might be more common in fake news (high discrimination)
- “according” or “officials” might be more common in real news (high discrimination)
Vectorization Configuration
The system uses an optimized TF-IDF configuration:fake_news_ia.py
| Parameter | Value | Purpose |
|---|---|---|
max_features | 5000 | Limit to 5,000 most important features (words/phrases) |
ngram_range | (1, 2) | Capture both single words (unigrams) and two-word phrases (bi-grams) |
Why 5,000 Features?
Performance
Smaller feature space means faster training and prediction
Quality
5,000 most important features capture the essential patterns
Overfitting Prevention
Limiting features prevents the model from memorizing noise
Memory Efficiency
Manageable memory footprint for production deployment
Why N-grams (1, 2)?
Using both unigrams and bi-grams captures different linguistic patterns: Unigrams (single words):Bi-grams capture context that single words miss. For example:
- “not good” has opposite meaning to “good”
- “fake news” as a phrase has different significance than “fake” alone
- “breaking news” is a common pattern in sensationalist headlines
Vectorization Output
After vectorization, the text data becomes a sparse matrix:fake_news_ia.py
Model Selection: Logistic Regression
Why Logistic Regression?
The project uses Logistic Regression for several strategic reasons:Simplicity
Easy to implement, understand, and debug
Speed
Trains in seconds on 44,000 samples, predicts instantly
Interpretability
Can inspect feature weights to understand decision-making
Effectiveness
Excellent performance on text classification (98.5% accuracy)
Logistic Regression is a linear classifier that works exceptionally well for high-dimensional text data where classes are often linearly separable in TF-IDF space.
Training Configuration
The model is trained with specific hyperparameters:fake_news_ia.py
| Parameter | Value | Purpose |
|---|---|---|
max_iter | 1000 | Maximum iterations for convergence (prevents early stopping) |
solver | ’liblinear’ | Optimization algorithm, effective for small-medium datasets |
random_state | 42 | Ensures reproducible results across runs |
Why solver='liblinear'?
Understanding solver choice
Understanding solver choice
Scikit-learn offers multiple solvers for Logistic Regression:
- lbfgs - Default, good for large datasets
- liblinear - Fast for small-medium datasets, supports L1/L2 regularization
- newton-cg - Good for large datasets
- sag/saga - Stochastic solvers for very large datasets
liblinear offers:- Fast convergence
- Memory efficiency
- Proven reliability for text classification
Why max_iter=1000?
Logistic Regression uses iterative optimization. Setting max_iter=1000 ensures:
- The algorithm has enough iterations to converge
- No premature stopping warnings
- Stable, optimal model parameters
In practice, the model likely converges in fewer than 1000 iterations, but this value provides a safe margin.
Model Persistence
After training, both the model and vectorizer are saved:fake_news_ia.py
Model Evaluation
Accuracy Score
The primary metric is classification accuracy:fake_news_ia.py
Classification Report
The detailed report provides per-class metrics:Understanding precision, recall, and F1-score
Understanding precision, recall, and F1-score
Precision:
- Fake: 99% of articles predicted as “fake” are actually fake
- Real: 98% of articles predicted as “real” are actually real
- Fake: 98% of actual fake articles are correctly identified
- Real: 99% of actual real articles are correctly identified
- Harmonic mean of precision and recall
- Both classes achieve ~0.98-0.99, indicating excellent balanced performance
- Number of actual examples in each class
- 4,696 fake and 4,104 real in test set (relatively balanced)
How 98.5% Accuracy is Achieved
The exceptional performance results from combining multiple factors:Inference: Predicting New Articles
The trained model can classify new articles:fake_news_ia.py
Critical: New articles must go through:
- Same preprocessing -
limpiar_textofunction - Same vectorizer - Use
vectorizer.transform()(NOTfit_transform()) - Same model - The trained
modeloobject
Production Deployment
The Streamlit app loads and uses the saved artifacts:app.py
- User pastes article text in Streamlit interface
- Text is cleaned with
limpiar_texto - Cleaned text is vectorized with saved
vectorizer - Model predicts “real” or “fake”
- Result displayed with appropriate UI (✅ or ❌)
Model Limitations
While the model achieves 98.5% accuracy, it has limitations:Next Steps
Architecture
Review the overall system architecture
Data Pipeline
Return to data loading and preparation
NLP Preprocessing
Review text cleaning techniques