Skip to main content

Overview

The fake news detector uses two Kaggle datasets containing labeled fake and real news articles. Proper dataset preparation is crucial for achieving the model’s high accuracy (98.5%).

Dataset Sources

The training data comes from two CSV files:
Fake.csv
CSV file
Contains fake news articles from Kaggle’s Fake News dataset
True.csv
CSV file
Contains real news articles from Kaggle’s Real News dataset

Required CSV Structure

Each CSV file must contain the following fields:
  • title - The headline of the news article
  • text - The full body text of the article
  • Additional metadata fields (optional, not used in training)

Data Loading Process

1

Load Individual Datasets

Load both fake and real news datasets using pandas:
import pandas as pd

fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")
Reference: fake_news_ia.py:19-20
2

Add Labels

Add label columns to distinguish fake from real news:
fake["label"] = "fake"
true["label"] = "real"
Reference: fake_news_ia.py:23-24
3

Concatenate Datasets

Merge both datasets while keeping only necessary columns:
df = pd.concat(
    [fake[['title', 'text', 'label']], 
     true[['title', 'text', 'label']]], 
    ignore_index=True
)
Reference: fake_news_ia.py:27

Key Feature Engineering

Critical Optimization: Combining title and text significantly improves model accuracy by providing more contextual information.

Combining Title and Text

The model concatenates the article title with its body text to create a richer feature set:
df['full_text'] = df['title'].astype(str) + ' ' + df['text'].astype(str)
Reference: fake_news_ia.py:30 This combination allows the model to learn from both the headline patterns (which often differ significantly between fake and real news) and the article content.

Data Cleaning

1

Remove Null Values

Drop rows with missing values in critical columns:
df.dropna(subset=['full_text', 'label'], inplace=True)
Reference: fake_news_ia.py:34
2

Fill Remaining Nulls

Replace any remaining null values with empty strings:
df.fillna('', inplace=True)
Reference: fake_news_ia.py:35
3

Verify Dataset

Check the total count and label distribution:
print(f"Total de noticias después de limpieza de nulos: {len(df)}")
print(df['label'].value_counts())
Reference: fake_news_ia.py:37-39

Text Preprocessing

After loading, the text undergoes NLP preprocessing to improve model performance:

Cleaning Function

The limpiar_texto() function performs the following operations:
def limpiar_texto(texto):
    # 1. Remove metadata/source patterns (e.g., WASHINGTON (REUTERS) - )
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
    
    # 2. Convert to lowercase
    texto = str(texto).lower()
    
    # 3. Remove punctuation, numbers, and special characters
    texto = re.sub(r'[^a-z\s]', '', texto)
    
    # 4. Tokenize with split()
    tokens = texto.split()
    
    # 5. Filter stopwords and single-letter tokens
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)
Reference: fake_news_ia.py:54-69

Applying Cleaning

Apply the cleaning function to the combined text:
df["clean_text"] = df["full_text"].apply(limpiar_texto)
Reference: fake_news_ia.py:72
Ensure NLTK stopwords are downloaded before running the preprocessing:
python3 -c 'import nltk; nltk.download("stopwords"); nltk.download("punkt")'

Dataset Statistics

After preparation, the dataset typically contains:
  • Total articles: ~44,000 (after null removal)
  • Train set: 80% (~35,200 articles)
  • Test set: 20% (~8,800 articles)
  • Label distribution: Approximately balanced between fake and real

Error Handling

The dataset preparation includes error handling for missing files:
try:
    fake = pd.read_csv("Fake.csv")
    true = pd.read_csv("True.csv")
except FileNotFoundError:
    print("Error: Asegúrate de que los archivos 'Fake.csv' y 'True.csv' estén en la misma carpeta.")
    sys.exit()
Reference: fake_news_ia.py:42-44

Next Steps

After preparing the dataset, proceed to Text Vectorization to learn how the cleaned text is converted into numerical features.

Build docs developers (and LLMs) love