Overview
The data pipeline is the foundation of the fake news detector, responsible for ingesting, cleaning, and preparing approximately 44,000 news articles for model training. This stage ensures high-quality input data that directly impacts the model’s 98.5% accuracy.Data Sources
The project uses the Fake and Real News Dataset from Kaggle, consisting of two CSV files:Fake.csv
Contains fake news articles with title, text, subject, and date fields
True.csv
Contains real news articles with the same structure
Pipeline Stages
1. Dataset Loading
The first step loads both CSV files and prepares them for combination:fake_news_ia.py
2. Label Assignment
Each dataset is assigned a binary label to enable supervised learning:- Fake news →
"fake"label - Real news →
"real"label
3. Dataset Combination
Both datasets are merged into a single DataFrame, selecting only the necessary columns:fake_news_ia.py
Using
ignore_index=True ensures the combined DataFrame has a continuous index from 0 to n-1, preventing index conflicts.4. Feature Engineering: Title + Text Combination
This is the KEY SOLUTION that significantly boosts model accuracy.
title and text fields into a single full_text column:
fake_news_ia.py
- Headlines contain crucial signals - Fake news often uses sensationalist or clickbait titles
- Maximum semantic context - The model sees both the hook (title) and the content (body)
- Pattern recognition - Combining fields helps identify writing style inconsistencies
5. Missing Value Handling
The pipeline implements a two-step cleaning process:fake_news_ia.py
- Drop critical nulls - Remove rows where
full_textorlabelis missing (these cannot be used for training) - Fill remaining nulls - Replace any other null values with empty strings to prevent errors
6. Data Distribution Analysis
After cleaning, the pipeline displays label distribution to check for class imbalance:Train/Test Split
After preprocessing, the data is split into training and testing sets:fake_news_ia.py
- 80% training - Approximately 35,200 articles for model learning
- 20% testing - Approximately 8,800 articles for unbiased evaluation
- random_state=42 - Ensures reproducible splits across runs
Training Set
~35,200 articles used to train the Logistic Regression model
Test Set
~8,800 articles held out for final performance evaluation
Data Quality Assurance
The pipeline includes several quality checks:fake_news_ia.py
- Original title
- Original text
- Cleaned text (after NLP preprocessing)
- Assigned label
Key Design Choices
Why Combine Title and Text?
Fake news detection benefits from analyzing both components:| Component | Signals |
|---|---|
| Title | Sensationalism, clickbait patterns, emotional manipulation |
| Text | Factual inconsistencies, writing quality, source citations |
| Combined | Coherence between headline and content |
Why 80/20 Split?
Reasoning behind the 80/20 split
Reasoning behind the 80/20 split
The 80/20 split is a standard machine learning practice that:
- Maximizes training data - More data for the model to learn patterns (80%)
- Ensures valid evaluation - Sufficient test data for statistical significance (20%)
- Prevents overfitting - Held-out test set detects if the model memorizes training data
- Industry standard - Widely accepted for datasets of this size (44K samples)
Pipeline Output
After completion, the data pipeline produces:- X_train - TF-IDF vectorized training features (shape: ~35,200 × 5,000)
- X_test - TF-IDF vectorized test features (shape: ~8,800 × 5,000)
- y_train - Training labels (“fake” or “real”)
- y_test - Test labels for evaluation
Error Handling
The pipeline includes robust error handling:fake_news_ia.py
Next Steps
NLP Preprocessing
Learn how text is cleaned and prepared for vectorization
Model Training
Understand how the Logistic Regression model is trained