Model Training

Overview

The model training phase converts preprocessed text into numerical features, trains a Logistic Regression classifier, and evaluates its performance. This stage is where the system achieves its impressive 98.5% accuracy on fake news detection.

Feature Extraction: TF-IDF Vectorization

What is TF-IDF?

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection:

TF (Term Frequency) - How often a word appears in a document
IDF (Inverse Document Frequency) - How rare/common a word is across all documents

Why TF-IDF is perfect for fake news detection

TF-IDF identifies words that are distinctive to fake vs. real news:

High TF-IDF - Words that appear frequently in one article but rarely elsewhere (distinctive)
Low TF-IDF - Common words that appear everywhere (less informative)

For example:

“president” might appear in both fake and real news (low discrimination)
“shocking” or “revealed” might be more common in fake news (high discrimination)
“according” or “officials” might be more common in real news (high discrimination)

TF-IDF automatically weights these patterns numerically.

Vectorization Configuration

The system uses an optimized TF-IDF configuration:

fake_news_ia.py

# Optimized TF-IDF Vectorization: Fast and robust (Bi-grams, 5000 features)
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2)) 
X_tfidf = vectorizer.fit_transform(X)

Configuration parameters:

Parameter	Value	Purpose
`max_features`	5000	Limit to 5,000 most important features (words/phrases)
`ngram_range`	(1, 2)	Capture both single words (unigrams) and two-word phrases (bi-grams)

Why 5,000 Features?

Performance

Smaller feature space means faster training and prediction

Quality

5,000 most important features capture the essential patterns

Overfitting Prevention

Limiting features prevents the model from memorizing noise

Memory Efficiency

Manageable memory footprint for production deployment

Why N-grams (1, 2)?

Using both unigrams and bi-grams captures different linguistic patterns: Unigrams (single words):

"president", "announced", "infrastructure"

Bi-grams (two-word phrases):

"president announced", "announced infrastructure", "infrastructure plan"

Bi-grams capture context that single words miss. For example:

“not good” has opposite meaning to “good”
“fake news” as a phrase has different significance than “fake” alone
“breaking news” is a common pattern in sensationalist headlines

Vectorization Output

After vectorization, the text data becomes a sparse matrix:

fake_news_ia.py

print(f"Training set size: {X_train.shape[0]} ({X_train.shape[1]} features)")
print(f"Test set size: {X_test.shape[0]}")

Output:

Training set size: 35200 (5000 features)
Test set size: 8800

Each article is now represented as a vector of 5,000 TF-IDF scores.

Model Selection: Logistic Regression

Why Logistic Regression?

The project uses Logistic Regression for several strategic reasons:

Simplicity

Easy to implement, understand, and debug

Speed

Trains in seconds on 44,000 samples, predicts instantly

Interpretability

Can inspect feature weights to understand decision-making

Effectiveness

Excellent performance on text classification (98.5% accuracy)

Logistic Regression is a linear classifier that works exceptionally well for high-dimensional text data where classes are often linearly separable in TF-IDF space.

Training Configuration

The model is trained with specific hyperparameters:

fake_news_ia.py

print("--- 4. Model Training ---")
modelo = LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)
modelo.fit(X_train, y_train)
print("Model trained successfully! ✅")

Hyperparameters explained:

Parameter	Value	Purpose
`max_iter`	1000	Maximum iterations for convergence (prevents early stopping)
`solver`	’liblinear’	Optimization algorithm, effective for small-medium datasets
`random_state`	42	Ensures reproducible results across runs

Why `solver='liblinear'`?

Understanding solver choice

Scikit-learn offers multiple solvers for Logistic Regression:

lbfgs - Default, good for large datasets
liblinear - Fast for small-medium datasets, supports L1/L2 regularization
newton-cg - Good for large datasets
sag/saga - Stochastic solvers for very large datasets

For this project with ~35K training samples and 5K features, liblinear offers:

Fast convergence
Memory efficiency
Proven reliability for text classification

Why `max_iter=1000`?

Logistic Regression uses iterative optimization. Setting max_iter=1000 ensures:

The algorithm has enough iterations to converge
No premature stopping warnings
Stable, optimal model parameters

In practice, the model likely converges in fewer than 1000 iterations, but this value provides a safe margin.

Model Persistence

After training, both the model and vectorizer are saved:

fake_news_ia.py

# *** SAVE THE MODEL AND VECTORIZER ***
joblib.dump(modelo, 'modelo_fake_news.pkl')
joblib.dump(vectorizer, 'vectorizer_tfidf.pkl')
print("Models saved successfully as 'modelo_fake_news.pkl' and 'vectorizer_tfidf.pkl'")

Why save both?

You MUST save the vectorizer along with the model. Here’s why:

Feature consistency - The vectorizer learned which 5,000 words to use during training
Same preprocessing - New text must be vectorized identically to training data
Vocabulary mapping - The vectorizer contains the word-to-index mapping

If you only save the model but create a new vectorizer at inference time, predictions will be incorrect because the feature indices won’t match.

Model Evaluation

Accuracy Score

The primary metric is classification accuracy:

fake_news_ia.py

print("--- 5. Model Evaluation ---")
y_pred = modelo.predict(X_test)

print("Accuracy (Overall Precision):", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Expected output:

Accuracy (Overall Precision): 0.985

This means the model correctly classifies 98.5% of test articles.

Classification Report

The detailed report provides per-class metrics:

              precision    recall  f1-score   support

        fake       0.99      0.98      0.98      4696
        real       0.98      0.99      0.99      4104

    accuracy                           0.98      8800
   macro avg       0.98      0.98      0.98      8800
weighted avg       0.98      0.98      0.98      8800

Metrics explained:

Understanding precision, recall, and F1-score

Precision:

Fake: 99% of articles predicted as “fake” are actually fake
Real: 98% of articles predicted as “real” are actually real

Recall:

Fake: 98% of actual fake articles are correctly identified
Real: 99% of actual real articles are correctly identified

F1-Score:

Harmonic mean of precision and recall
Both classes achieve ~0.98-0.99, indicating excellent balanced performance

Support:

Number of actual examples in each class
4,696 fake and 4,104 real in test set (relatively balanced)

How 98.5% Accuracy is Achieved

The exceptional performance results from combining multiple factors:

High-Quality Training Data

44,000 labeled articles from diverse sources

Feature Engineering

Combining title + text provides maximum context

Anti-Bias Preprocessing

Removing source metadata forces content-based learning

Optimized Vectorization

TF-IDF with bi-grams captures meaningful linguistic patterns

Appropriate Algorithm

Logistic Regression excels at high-dimensional text classification

Proper Evaluation

20% held-out test set ensures unbiased accuracy measurement

Inference: Predicting New Articles

The trained model can classify new articles:

fake_news_ia.py

print("--- 6. Test with New Articles ---")

# Sample news articles
noticias_nuevas = [
    "The Federal Reserve announced on Wednesday that it will maintain the benchmark interest rate...",
    "A secret meeting was held at the UN headquarters where delegates voted to replace all sugary drinks...",
    "President Joe Biden announced a new infrastructure plan, stating, 'This investment will create millions of jobs...'"
]

# Apply same cleaning and vectorization
noticias_limpias = [limpiar_texto(n) for n in noticias_nuevas]
noticias_vec = vectorizer.transform(noticias_limpias)

# Make predictions
predicciones = modelo.predict(noticias_vec)

Critical: New articles must go through:

Same preprocessing - limpiar_texto function
Same vectorizer - Use vectorizer.transform() (NOT fit_transform())
Same model - The trained modelo object

Production Deployment

The Streamlit app loads and uses the saved artifacts:

app.py

# Load pre-trained artifacts
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')

# Process user input
noticia_limpia = limpiar_texto(noticia_input)
noticia_vec = vectorizer.transform([noticia_limpia])
prediccion = modelo.predict(noticia_vec)[0]

Deployment workflow:

User pastes article text in Streamlit interface
Text is cleaned with limpiar_texto
Cleaned text is vectorized with saved vectorizer
Model predicts “real” or “fake”
Result displayed with appropriate UI (✅ or ❌)

Model Limitations

While the model achieves 98.5% accuracy, it has limitations:

Known limitations:

Training distribution - Performs best on articles similar to training data
Language-specific - Only works for English text (trained on English stopwords)
Context-blind - Doesn’t understand external facts or current events
Pattern-based - May misclassify articles with unusual but legitimate writing styles
Temporal drift - Fake news patterns may evolve over time

Next Steps

Architecture

Review the overall system architecture

Data Pipeline

Return to data loading and preparation

NLP Preprocessing

Review text cleaning techniques

Get Started

Core Concepts

Training Guide

Inference

Advanced

Overview

Feature Extraction: TF-IDF Vectorization

What is TF-IDF?

Vectorization Configuration

Why 5,000 Features?

Performance

Quality

Overfitting Prevention

Memory Efficiency

Why N-grams (1, 2)?

Vectorization Output

Model Selection: Logistic Regression

Why Logistic Regression?

Simplicity

Speed

Interpretability

Effectiveness

Training Configuration

Why `solver='liblinear'`?

Why `max_iter=1000`?

Model Persistence

Model Evaluation

Accuracy Score

Classification Report

How 98.5% Accuracy is Achieved

Inference: Predicting New Articles

Production Deployment

Model Limitations

Next Steps

Architecture

Data Pipeline

NLP Preprocessing

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guide

Inference

Advanced

​Overview

​Feature Extraction: TF-IDF Vectorization

​What is TF-IDF?

​Vectorization Configuration

​Why 5,000 Features?

Performance

Quality

Overfitting Prevention

Memory Efficiency

​Why N-grams (1, 2)?

​Vectorization Output

​Model Selection: Logistic Regression

​Why Logistic Regression?

Simplicity

Speed

Interpretability

Effectiveness

​Training Configuration

​Why solver='liblinear'?

​Why max_iter=1000?

​Model Persistence

​Model Evaluation

​Accuracy Score

​Classification Report

​How 98.5% Accuracy is Achieved

​Inference: Predicting New Articles

​Production Deployment

​Model Limitations

​Next Steps

Architecture

Data Pipeline

NLP Preprocessing

Build docs developers (and LLMs) love

Overview

Feature Extraction: TF-IDF Vectorization

What is TF-IDF?

Vectorization Configuration

Why 5,000 Features?

Why N-grams (1, 2)?

Vectorization Output

Model Selection: Logistic Regression

Why Logistic Regression?

Training Configuration

Why `solver='liblinear'`?

Why `max_iter=1000`?

Model Persistence

Model Evaluation

Accuracy Score

Classification Report

How 98.5% Accuracy is Achieved

Inference: Predicting New Articles

Production Deployment

Model Limitations

Next Steps