Overview
The fake news detector uses two Kaggle datasets containing labeled fake and real news articles. Proper dataset preparation is crucial for achieving the model’s high accuracy (98.5%).
Dataset Sources
The training data comes from two CSV files:
Contains fake news articles from Kaggle’s Fake News dataset
Contains real news articles from Kaggle’s Real News dataset
Required CSV Structure
Each CSV file must contain the following fields:
title - The headline of the news article
text - The full body text of the article
- Additional metadata fields (optional, not used in training)
Data Loading Process
Load Individual Datasets
Load both fake and real news datasets using pandas:import pandas as pd
fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")
Reference: fake_news_ia.py:19-20 Add Labels
Add label columns to distinguish fake from real news:fake["label"] = "fake"
true["label"] = "real"
Reference: fake_news_ia.py:23-24 Concatenate Datasets
Merge both datasets while keeping only necessary columns:df = pd.concat(
[fake[['title', 'text', 'label']],
true[['title', 'text', 'label']]],
ignore_index=True
)
Reference: fake_news_ia.py:27
Key Feature Engineering
Critical Optimization: Combining title and text significantly improves model accuracy by providing more contextual information.
Combining Title and Text
The model concatenates the article title with its body text to create a richer feature set:
df['full_text'] = df['title'].astype(str) + ' ' + df['text'].astype(str)
Reference: fake_news_ia.py:30
This combination allows the model to learn from both the headline patterns (which often differ significantly between fake and real news) and the article content.
Data Cleaning
Remove Null Values
Drop rows with missing values in critical columns:df.dropna(subset=['full_text', 'label'], inplace=True)
Reference: fake_news_ia.py:34 Fill Remaining Nulls
Replace any remaining null values with empty strings:df.fillna('', inplace=True)
Reference: fake_news_ia.py:35 Verify Dataset
Check the total count and label distribution:print(f"Total de noticias después de limpieza de nulos: {len(df)}")
print(df['label'].value_counts())
Reference: fake_news_ia.py:37-39
Text Preprocessing
After loading, the text undergoes NLP preprocessing to improve model performance:
Cleaning Function
The limpiar_texto() function performs the following operations:
def limpiar_texto(texto):
# 1. Remove metadata/source patterns (e.g., WASHINGTON (REUTERS) - )
texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
# 2. Convert to lowercase
texto = str(texto).lower()
# 3. Remove punctuation, numbers, and special characters
texto = re.sub(r'[^a-z\s]', '', texto)
# 4. Tokenize with split()
tokens = texto.split()
# 5. Filter stopwords and single-letter tokens
tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
return " ".join(tokens)
Reference: fake_news_ia.py:54-69
Applying Cleaning
Apply the cleaning function to the combined text:
df["clean_text"] = df["full_text"].apply(limpiar_texto)
Reference: fake_news_ia.py:72
Ensure NLTK stopwords are downloaded before running the preprocessing:python3 -c 'import nltk; nltk.download("stopwords"); nltk.download("punkt")'
Dataset Statistics
After preparation, the dataset typically contains:
- Total articles: ~44,000 (after null removal)
- Train set: 80% (~35,200 articles)
- Test set: 20% (~8,800 articles)
- Label distribution: Approximately balanced between fake and real
Error Handling
The dataset preparation includes error handling for missing files:try:
fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")
except FileNotFoundError:
print("Error: Asegúrate de que los archivos 'Fake.csv' y 'True.csv' estén en la misma carpeta.")
sys.exit()
Reference: fake_news_ia.py:42-44
Next Steps
After preparing the dataset, proceed to Text Vectorization to learn how the cleaned text is converted into numerical features.