Overview
The fake news detector uses Logistic Regression as its classification algorithm. Despite being a relatively simple model, it achieves 98.5% accuracy thanks to optimized hyperparameters and quality feature engineering.Model Selection
Why Logistic Regression?
Logistic Regression was chosen for this project because:- High Accuracy: Achieves 98.5% on the test set
- Fast Training: Trains in seconds even on large datasets
- Fast Predictions: Real-time inference for production use
- Interpretable: Feature weights show which terms indicate fake vs. real news
- Low Resource Requirements: No GPU needed, minimal memory usage
- Production-Ready: Simple to deploy and maintain
For binary text classification with good features (TF-IDF), Logistic Regression often matches or exceeds the performance of more complex models like Random Forests or Neural Networks, while being much faster and easier to interpret.
Model Configuration
fake_news_ia.py:95-96
Hyperparameters
Maximum number of iterations for the optimization algorithm to converge.Why 1000?
- Default is 100, which may not converge for complex datasets
- 1000 ensures the solver finds the optimal solution
- Training typically converges in 200-400 iterations
- No significant performance penalty for allowing more iterations
Algorithm used for optimization.Why ‘liblinear’?
- Best for smaller datasets (under 100k samples) with high-dimensional features
- Supports L1 and L2 regularization for feature selection
- Faster convergence than ‘lbfgs’ or ‘saga’ for this use case
- Handles sparse matrices efficiently (our TF-IDF output)
'lbfgs': Better for large datasets (>100k samples)'saga': Good for very large datasets, supports L1 regularization'newton-cg': Fast but only supports L2 regularization
Seed for the random number generator to ensure reproducible results.Why 42?
- Standard convention in data science (reference to “The Hitchhiker’s Guide to the Galaxy”)
- Ensures consistent results across multiple runs
- Critical for reproducible research and debugging
Default Parameters (Not Explicitly Set)
The model also uses scikit-learn’s default values for other important parameters:Regularization type to prevent overfitting. L2 (Ridge) regularization is applied by default.
Inverse of regularization strength. Smaller values = stronger regularization.Default value of 1.0 works well for this dataset. You can tune this if needed:
- C < 1.0: More regularization, prevents overfitting
- C > 1.0: Less regularization, allows model to fit training data more closely
Not set, assuming balanced classes. If you have imbalanced data, use
class_weight='balanced'.Training Process
Prepare Training Data
Ensure you have vectorized features and labels:Reference:
fake_news_ia.py:86-88Initialize Model
Create the Logistic Regression model with optimized hyperparameters:Reference:
fake_news_ia.py:95Model Persistence
fake_news_ia.py:100-102
Loading Saved Models
To use the saved models for predictions:What the Model Learns
During training, Logistic Regression learns feature weights (coefficients) for each of the 5000 TF-IDF features:- Positive weights: Terms associated with real news
- Negative weights: Terms associated with fake news
- Weights near zero: Terms that don’t help distinguish fake from real
Example Feature Weights
You can inspect the most important features:Common patterns:
- Real news indicators: “reuters”, “said officials”, “according”, “government announced”
- Fake news indicators: “shocking”, “you won believe”, “must see”, “mainstream media”
Hyperparameter Tuning (Optional)
If you want to experiment with different configurations:The current hyperparameters (
max_iter=1000, solver='liblinear') already achieve 98.5% accuracy, so tuning may only provide marginal improvements.