System Architecture

Overview

The Fake News Detector is built on a classic machine learning pipeline architecture that achieves 98.5% accuracy using Natural Language Processing (NLP) and Logistic Regression. The system processes approximately 44,000 news articles from Kaggle’s Fake and Real News dataset.

Architecture Pipeline

The system follows a six-stage pipeline from data ingestion to inference:

Data Loading

Load and combine datasets from Fake.csv and True.csv

Data Cleaning & Preparation

Merge title and text fields, handle missing values, assign labels

NLP Preprocessing

Clean text by removing metadata, stopwords, and special characters

Vectorization

Convert cleaned text to numerical features using TF-IDF

Model Training

Train Logistic Regression classifier and evaluate performance

Persistence & Inference

Save model artifacts and enable real-time prediction

System Components

Training Pipeline (`fake_news_ia.py`)

The training script orchestrates the complete machine learning workflow:

fake_news_ia.py

# Load datasets
fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")

# Assign labels
fake["label"] = "fake"
true["label"] = "real"

# Combine datasets
df = pd.concat([fake[['title', 'text', 'label']], 
                true[['title', 'text', 'label']]], ignore_index=True)

# KEY: Combine title and text for maximum context
df['full_text'] = df['title'].astype(str) + ' ' + df['text'].astype(str)

Inference Application (`app.py`)

The Streamlit web application provides a user-friendly interface for real-time classification:

app.py

# Load pre-trained artifacts
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')

# Process user input
noticia_limpia = limpiar_texto(noticia_input)
noticia_vec = vectorizer.transform([noticia_limpia])
prediccion = modelo.predict(noticia_vec)[0]

Key Design Decisions

Title + Text Combination

Combining title and text fields provides maximum semantic context, improving model accuracy by capturing both headline patterns and article content.

Metadata Removal

Removing source metadata (e.g., “WASHINGTON (REUTERS) -”) prevents the model from learning source bias, forcing it to focus on content quality.

TF-IDF Vectorization

TF-IDF with bi-grams (1,2) and 5,000 features balances performance and dimensionality, capturing key phrases while remaining computationally efficient.

Model Persistence

Saving both the model and vectorizer with joblib ensures consistent preprocessing and prediction in production.

Data Flow Diagram

Here’s how data flows through the system:

┌─────────────────┐
│  Fake.csv       │
│  True.csv       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Data Loading   │ ──► Combine datasets, add labels
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Preprocessing  │ ──► Clean text, remove stopwords
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Vectorization  │ ──► TF-IDF (5000 features, bi-grams)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Train/Test     │ ──► 80/20 split
│  Split          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Model Training │ ──► Logistic Regression
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Persistence    │ ──► Save .pkl files
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Inference API  │ ──► Streamlit App
└─────────────────┘

Performance Metrics

The system achieves exceptional performance on the test set:

Accuracy: 98.5% - The model correctly classifies approximately 98.5 out of every 100 news articles as real or fake.

This high accuracy is achieved through:

Quality training data - 44,000 labeled articles
Effective feature engineering - Title + text combination
Robust preprocessing - Metadata removal and stopword filtering
Optimized vectorization - TF-IDF with bi-grams
Appropriate algorithm - Logistic Regression for binary text classification

Technology Stack

Component	Technology	Purpose
Data Processing	Pandas	Dataset manipulation and cleaning
NLP	NLTK	Stopwords, tokenization
Vectorization	scikit-learn (TF-IDF)	Text-to-numerical conversion
Model	scikit-learn (LogisticRegression)	Binary classification
Persistence	joblib	Model serialization
Web Interface	Streamlit	User-facing application

Next Steps

Data Pipeline

Learn about data loading, cleaning, and preparation

NLP Preprocessing

Explore text cleaning and preprocessing techniques

Model Training

Understand the training process and evaluation

Get Started

Core Concepts

Training Guide

Inference

Advanced

Overview

Architecture Pipeline

System Components

Training Pipeline (`fake_news_ia.py`)

Inference Application (`app.py`)

Key Design Decisions

Title + Text Combination

Metadata Removal

TF-IDF Vectorization

Model Persistence

Data Flow Diagram

Performance Metrics

Technology Stack

Next Steps

Data Pipeline

NLP Preprocessing

Model Training

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guide

Inference

Advanced

​Overview

​Architecture Pipeline

​System Components

​Training Pipeline (fake_news_ia.py)

​Inference Application (app.py)

​Key Design Decisions

Title + Text Combination

Metadata Removal

TF-IDF Vectorization

Model Persistence

​Data Flow Diagram

​Performance Metrics

​Technology Stack

​Next Steps

Data Pipeline

NLP Preprocessing

Model Training

Build docs developers (and LLMs) love

Overview

Architecture Pipeline

System Components

Training Pipeline (`fake_news_ia.py`)

Inference Application (`app.py`)

Key Design Decisions

Data Flow Diagram

Performance Metrics

Technology Stack

Next Steps