Overview
The Fake News Detector is built on a classic machine learning pipeline architecture that achieves 98.5% accuracy using Natural Language Processing (NLP) and Logistic Regression. The system processes approximately 44,000 news articles from Kaggle’s Fake and Real News dataset.Architecture Pipeline
The system follows a six-stage pipeline from data ingestion to inference:System Components
Training Pipeline (fake_news_ia.py)
The training script orchestrates the complete machine learning workflow:
fake_news_ia.py
Inference Application (app.py)
The Streamlit web application provides a user-friendly interface for real-time classification:
app.py
Key Design Decisions
Title + Text Combination
Combining title and text fields provides maximum semantic context, improving model accuracy by capturing both headline patterns and article content.
Metadata Removal
Removing source metadata (e.g., “WASHINGTON (REUTERS) -”) prevents the model from learning source bias, forcing it to focus on content quality.
TF-IDF Vectorization
TF-IDF with bi-grams (1,2) and 5,000 features balances performance and dimensionality, capturing key phrases while remaining computationally efficient.
Model Persistence
Saving both the model and vectorizer with joblib ensures consistent preprocessing and prediction in production.
Data Flow Diagram
Here’s how data flows through the system:Performance Metrics
The system achieves exceptional performance on the test set:Accuracy: 98.5% - The model correctly classifies approximately 98.5 out of every 100 news articles as real or fake.
- Quality training data - 44,000 labeled articles
- Effective feature engineering - Title + text combination
- Robust preprocessing - Metadata removal and stopword filtering
- Optimized vectorization - TF-IDF with bi-grams
- Appropriate algorithm - Logistic Regression for binary text classification
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Data Processing | Pandas | Dataset manipulation and cleaning |
| NLP | NLTK | Stopwords, tokenization |
| Vectorization | scikit-learn (TF-IDF) | Text-to-numerical conversion |
| Model | scikit-learn (LogisticRegression) | Binary classification |
| Persistence | joblib | Model serialization |
| Web Interface | Streamlit | User-facing application |
Next Steps
Data Pipeline
Learn about data loading, cleaning, and preparation
NLP Preprocessing
Explore text cleaning and preprocessing techniques
Model Training
Understand the training process and evaluation