Data Pipeline

Overview

The data pipeline is the foundation of the fake news detector, responsible for ingesting, cleaning, and preparing approximately 44,000 news articles for model training. This stage ensures high-quality input data that directly impacts the model’s 98.5% accuracy.

Data Sources

The project uses the Fake and Real News Dataset from Kaggle, consisting of two CSV files:

Fake.csv

Contains fake news articles with title, text, subject, and date fields

True.csv

Contains real news articles with the same structure

Pipeline Stages

1. Dataset Loading

The first step loads both CSV files and prepares them for combination:

fake_news_ia.py

print("--- 1. Carga de Datos ---")
try:
    # Load datasets, selecting 'title', 'text' and 'label'
    fake = pd.read_csv("Fake.csv")
    true = pd.read_csv("True.csv")
    
    # Add labels
    fake["label"] = "fake"
    true["label"] = "real"
except FileNotFoundError:
    print("Error: Make sure 'Fake.csv' and 'True.csv' are in the same folder.")
    sys.exit()

The script includes error handling for missing files. Ensure both CSV files are in the same directory as fake_news_ia.py before running.

2. Label Assignment

Each dataset is assigned a binary label to enable supervised learning:

Fake news → "fake" label
Real news → "real" label

This simple binary classification approach allows the Logistic Regression model to learn patterns distinguishing authentic from fabricated content.

3. Dataset Combination

Both datasets are merged into a single DataFrame, selecting only the necessary columns:

fake_news_ia.py

# Combine datasets (keeping only necessary columns)
df = pd.concat(
    [fake[['title', 'text', 'label']], 
     true[['title', 'text', 'label']]], 
    ignore_index=True
)

Using ignore_index=True ensures the combined DataFrame has a continuous index from 0 to n-1, preventing index conflicts.

4. Feature Engineering: Title + Text Combination

This is the KEY SOLUTION that significantly boosts model accuracy.

The pipeline combines the title and text fields into a single full_text column:

fake_news_ia.py

# *** KEY SOLUTION: Combine Title and Text for more context ***
df['full_text'] = df['title'].astype(str) + ' ' + df['text'].astype(str)

Why this matters:

Headlines contain crucial signals - Fake news often uses sensationalist or clickbait titles
Maximum semantic context - The model sees both the hook (title) and the content (body)
Pattern recognition - Combining fields helps identify writing style inconsistencies

5. Missing Value Handling

The pipeline implements a two-step cleaning process:

fake_news_ia.py

# Check and clean null values
print(f"Total news before null cleaning: {len(df)}")
df.dropna(subset=['full_text', 'label'], inplace=True) 
df.fillna('', inplace=True) 

print(f"Total news after null cleaning: {len(df)}")
print("Label distribution:")
print(df['label'].value_counts())

Cleaning strategy:

Drop critical nulls - Remove rows where full_text or label is missing (these cannot be used for training)
Fill remaining nulls - Replace any other null values with empty strings to prevent errors

6. Data Distribution Analysis

After cleaning, the pipeline displays label distribution to check for class imbalance:

Label distribution:
real    21417
fake    23481
Name: label, dtype: int64

The dataset is relatively balanced, which is ideal for binary classification without requiring special handling like SMOTE or class weights.

Train/Test Split

After preprocessing, the data is split into training and testing sets:

fake_news_ia.py

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)
print(f"Training set size: {X_train.shape[0]} ({X_train.shape[1]} features)")
print(f"Test set size: {X_test.shape[0]}")

Split configuration:

80% training - Approximately 35,200 articles for model learning
20% testing - Approximately 8,800 articles for unbiased evaluation
random_state=42 - Ensures reproducible splits across runs

Training Set

~35,200 articles used to train the Logistic Regression model

Test Set

~8,800 articles held out for final performance evaluation

Data Quality Assurance

The pipeline includes several quality checks:

fake_news_ia.py

print(df[["title", "text", "clean_text", "label"]].head())

This displays:

Original title
Original text
Cleaned text (after NLP preprocessing)
Assigned label

Allowing visual inspection to verify preprocessing correctness.

Key Design Choices

Why Combine Title and Text?

Fake news detection benefits from analyzing both components:

Component	Signals
Title	Sensationalism, clickbait patterns, emotional manipulation
Text	Factual inconsistencies, writing quality, source citations
Combined	Coherence between headline and content

Why 80/20 Split?

Reasoning behind the 80/20 split

The 80/20 split is a standard machine learning practice that:

Maximizes training data - More data for the model to learn patterns (80%)
Ensures valid evaluation - Sufficient test data for statistical significance (20%)
Prevents overfitting - Held-out test set detects if the model memorizes training data
Industry standard - Widely accepted for datasets of this size (44K samples)

Pipeline Output

After completion, the data pipeline produces:

X_train - TF-IDF vectorized training features (shape: ~35,200 × 5,000)
X_test - TF-IDF vectorized test features (shape: ~8,800 × 5,000)
y_train - Training labels (“fake” or “real”)
y_test - Test labels for evaluation

Error Handling

The pipeline includes robust error handling:

fake_news_ia.py

try:
    fake = pd.read_csv("Fake.csv")
    true = pd.read_csv("True.csv")
except FileNotFoundError:
    print("Error: Make sure 'Fake.csv' and 'True.csv' are in the same folder.")
    sys.exit()

If datasets are missing, the script exits gracefully with a clear error message instead of crashing.

Get Started

Core Concepts

Training Guide

Inference

Advanced

Overview

Data Sources

Fake.csv

True.csv

Pipeline Stages

1. Dataset Loading

2. Label Assignment

3. Dataset Combination

4. Feature Engineering: Title + Text Combination

5. Missing Value Handling

6. Data Distribution Analysis

Train/Test Split

Training Set

Test Set

Data Quality Assurance

Key Design Choices

Why Combine Title and Text?

Why 80/20 Split?

Pipeline Output

Error Handling

Next Steps

NLP Preprocessing

Model Training

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guide

Inference

Advanced

​Overview

​Data Sources

Fake.csv

True.csv

​Pipeline Stages

​1. Dataset Loading

​2. Label Assignment

​3. Dataset Combination

​4. Feature Engineering: Title + Text Combination

​5. Missing Value Handling

​6. Data Distribution Analysis

​Train/Test Split

Training Set

Test Set

​Data Quality Assurance

​Key Design Choices

​Why Combine Title and Text?

​Why 80/20 Split?

​Pipeline Output

​Error Handling

​Next Steps

NLP Preprocessing

Model Training

Build docs developers (and LLMs) love

Overview

Data Sources

Pipeline Stages

1. Dataset Loading

2. Label Assignment

3. Dataset Combination

4. Feature Engineering: Title + Text Combination

5. Missing Value Handling

6. Data Distribution Analysis

Train/Test Split

Data Quality Assurance

Key Design Choices

Why Combine Title and Text?

Why 80/20 Split?

Pipeline Output

Error Handling

Next Steps