Overview
The training workflow implements a complete machine learning pipeline that processes raw CSV datasets, applies NLP preprocessing, trains a Logistic Regression classifier, and saves the trained artifacts for production use. The final model achieves 98.5% accuracy on the test set.
Training Pipeline Steps
1. Data Loading and Preparation
# Load separate datasets for fake and real news
fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")
# Add binary labels
fake["label"] = "fake"
true["label"] = "real"
# Combine datasets
df = pd.concat([fake[['title', 'text', 'label']],
true[['title', 'text', 'label']]],
ignore_index=True)
# KEY IMPROVEMENT: Combine title and text for richer context
df['full_text'] = df['title'].astype(str) + ' ' + df['text'].astype(str)
# Clean null values
df.dropna(subset=['full_text', 'label'], inplace=True)
df.fillna('', inplace=True)
Dataset Structure: Expects two CSV files (Fake.csv and True.csv) with columns: title, text. The workflow automatically adds the label column.
2. Text Preprocessing
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
# Apply limpiar_texto function to combined title + text
df["clean_text"] = df["full_text"].apply(limpiar_texto)
Applies the 5-step limpiar_texto cleaning pipeline to create the clean_text column used for training.
3. Vectorization and Data Splitting
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
X = df["clean_text"]
y = df["label"]
# Vectorize with optimized TF-IDF parameters
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(X)
# 80/20 train-test split with fixed random seed
X_train, X_test, y_train, y_test = train_test_split(
X_tfidf, y, test_size=0.2, random_state=42
)
Limits vocabulary to top 5000 most important features for efficiency and generalization
Captures both unigrams (single words) and bigrams (word pairs) for richer feature representation
Reserves 20% of data for testing model performance
Fixed seed ensures reproducible train/test splits
4. Model Training
from sklearn.linear_model import LogisticRegression
modelo = LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)
modelo.fit(X_train, y_train)
Trains a Logistic Regression classifier with optimized hyperparameters. See LogisticRegression Configuration for parameter details.
5. Model Persistence
import joblib
joblib.dump(modelo, 'modelo_fake_news.pkl')
joblib.dump(vectorizer, 'vectorizer_tfidf.pkl')
Both artifacts must be saved: The trained model AND the fitted vectorizer are required for prediction. The vectorizer’s vocabulary must match the training data.
Saved Files:
modelo_fake_news.pkl - Trained LogisticRegression classifier
vectorizer_tfidf.pkl - Fitted TfidfVectorizer with learned vocabulary
6. Model Evaluation
from sklearn.metrics import accuracy_score, classification_report
y_pred = modelo.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Expected Performance:
- Overall Accuracy: ~98.5%
- High precision and recall for both “fake” and “real” classes
Complete Training Script
The full training workflow is implemented in fake_news_ia.py. To execute:
# Ensure NLTK dependencies are installed
python3 -c 'import nltk; nltk.download("stopwords")'
# Run training script
python fake_news_ia.py
Required Dependencies
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import re
import joblib
Output Artifacts
Serialized LogisticRegression model trained on 80% of the dataset. Contains learned feature weights for fake/real classification.
Serialized TfidfVectorizer with fitted vocabulary (5000 features) and IDF weights. Required for transforming new text inputs.
Key Design Decisions
Title + Text Concatenation
df['full_text'] = df['title'].astype(str) + ' ' + df['text'].astype(str)
Combining the article title with body text provides richer semantic context and improves classification accuracy.
Bi-gram Features
Capturing word pairs (e.g., “federal reserve”, “climate change”) helps the model learn multi-word phrases that are strong indicators of real or fake news.
Feature Limit
Balances model expressiveness with computational efficiency. Prevents overfitting while capturing the most discriminative vocabulary.
See Also