Skip to main content

Detect fake news with machine learning

This project is an AI-powered fake news detector that uses Natural Language Processing (NLP) and Logistic Regression to classify news articles as real or fake with 98.5% accuracy. Built with scikit-learn and deployed as an interactive Streamlit web application. The model is trained on approximately 44,000 news articles and uses TF-IDF vectorization with unigrams and bigrams to capture semantic patterns in fake news content.

Key features

98.5% accuracy

Achieves high precision through optimized TF-IDF vectorization and Logistic Regression

TF-IDF with N-grams

Uses unigrams and bigrams with 5,000 features to capture key phrases and patterns

44,000 training samples

Trained on large dataset from Kaggle’s Fake and Real News collections

Interactive Streamlit UI

Easy-to-use web interface for real-time news classification

Anti-bias design

Removes source metadata to ensure the model focuses on content, not publisher

NLP preprocessing

Advanced text cleaning with stopword removal and metadata filtering

How it works

The fake news detector uses a machine learning pipeline that combines:
  1. Data preparation - Combines title and text fields, removes source metadata like “WASHINGTON (REUTERS) -” to prevent bias
  2. NLP preprocessing - Removes stopwords, punctuation, and applies tokenization
  3. TF-IDF vectorization - Converts cleaned text into numerical features using Term Frequency-Inverse Document Frequency
  4. Logistic Regression - Fast and interpretable classification model
  5. Model persistence - Saves trained model and vectorizer using joblib for deployment
The model focuses on content analysis rather than source verification, making it capable of detecting fake news patterns regardless of publisher.

Get started

Quickstart

Train the model and classify your first news article in minutes

Installation

Set up your Python environment and install dependencies

Training

Learn how the model is trained and optimized

API Reference

Explore the prediction API and integration options

Architecture

The project consists of three main components:
  • fake_news_ia.py - Training script that processes 44,000 articles, trains the model, and saves artifacts
  • app.py - Streamlit web application for interactive news classification
  • predict_news.py - Command-line interface for batch predictions
All components use identical preprocessing functions to ensure consistency between training and inference.

Build docs developers (and LLMs) love