Skip to main content

Overview

The fake news detector uses scikit-learn’s LogisticRegression classifier with carefully tuned hyperparameters to achieve 98.5% accuracy. Logistic Regression was chosen for its interpretability, training efficiency, and strong performance on text classification tasks with TF-IDF features.

Model Initialization

fake_news_ia.py:95
from sklearn.linear_model import LogisticRegression

modelo = LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)

Hyperparameters

max_iter
int
default:"1000"
Maximum number of iterations for the solver to converge.Why 1000? The default value (100) is often insufficient for high-dimensional TF-IDF features (5000 features). Setting max_iter=1000 ensures the optimization algorithm has enough iterations to reach convergence, preventing warnings and suboptimal performance.
solver
string
default:"'liblinear'"
Optimization algorithm used to find model parameters.Why ‘liblinear’? This solver is specifically optimized for small to medium-sized datasets and performs excellently with high-dimensional sparse features like TF-IDF matrices. It’s faster than alternatives (‘lbfgs’, ‘saga’) for datasets of this size (~40,000 samples) and supports L1, L2, and elastic-net regularization.Alternatives:
  • 'lbfgs' - Better for large datasets but slower with sparse features
  • 'saga' - Supports L1 penalty but slower convergence
  • 'newton-cg' - Only supports L2 penalty
random_state
int
default:"42"
Seed for random number generation to ensure reproducibility.Why 42? This arbitrary but conventional seed ensures that:
  • Random initialization of weights is consistent across runs
  • Results are reproducible for debugging and comparison
  • Other team members get identical results with the same data

Default Parameters (Not Explicitly Set)

The following parameters use scikit-learn defaults but are important for understanding model behavior:
penalty
string
default:"'l2'"
Regularization type. The default L2 penalty (ridge regression) prevents overfitting by penalizing large coefficient values.
C
float
default:"1.0"
Inverse of regularization strength. Smaller values specify stronger regularization. The default value (1.0) provides moderate regularization suitable for most text classification tasks.
class_weight
string or dict
default:"None"
Weights associated with classes. None means all classes have weight 1. The dataset is balanced (roughly equal fake/real samples), so no class weighting is needed.
multi_class
string
default:"'auto'"
Multi-class strategy. With ‘liblinear’ solver and binary classification, this automatically uses one-vs-rest.

Training the Model

fake_news_ia.py:96
modelo.fit(X_train, y_train)
Inputs:
  • X_train: Sparse TF-IDF matrix of shape (n_samples, 5000) from the training set
  • y_train: Binary labels ("fake" or "real") for training samples
Output:
  • Fitted LogisticRegression object with learned feature weights

Model Performance

fake_news_ia.py:107-110
y_pred = modelo.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
Expected Metrics:
  • Accuracy: ~98.5%
  • Precision (fake): ~98-99%
  • Precision (real): ~98-99%
  • Recall (fake): ~98-99%
  • Recall (real): ~98-99%
  • F1-score: ~98.5% for both classes

Making Predictions

# Binary classification
prediction = modelo.predict(X_new)  # Returns array: ['fake'] or ['real']

# Probability scores
probabilities = modelo.predict_proba(X_new)  # Returns [[p_fake, p_real]]

Why Logistic Regression?

Advantages for Fake News Detection

  1. Interpretability: Feature coefficients directly indicate which words/phrases are most indicative of fake vs. real news
  2. Speed: Trains in seconds even on ~40,000 articles; predictions are near-instantaneous
  3. Low Resource Requirements: Minimal memory footprint; runs on standard hardware
  4. Robust to High Dimensionality: Performs well with sparse TF-IDF features (5000 dimensions)
  5. Probabilistic Output: predict_proba() provides confidence scores for predictions

Comparison with Alternatives

ModelAccuracyTraining TimeInterpretability
Logistic Regression98.5%FastHigh
Naive Bayes~95%Very FastHigh
Random Forest~97%SlowMedium
SVM (Linear)~98%MediumLow
Neural Networks~98-99%Very SlowVery Low
Logistic Regression provides the best balance of accuracy, speed, and interpretability for this application.

Feature Importance

Access learned feature weights to understand model decisions:
# Get feature names from vectorizer
feature_names = vectorizer.get_feature_names_out()

# Get model coefficients
coefficients = modelo.coef_[0]

# Find top predictive features for FAKE news (negative weights)
fake_indices = coefficients.argsort()[:20]
print("Top fake news indicators:")
for idx in fake_indices:
    print(f"{feature_names[idx]}: {coefficients[idx]:.4f}")

# Find top predictive features for REAL news (positive weights)
real_indices = coefficients.argsort()[-20:][::-1]
print("\nTop real news indicators:")
for idx in real_indices:
    print(f"{feature_names[idx]}: {coefficients[idx]:.4f}")

Hyperparameter Tuning (Optional)

To optimize hyperparameters for your specific dataset:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1.0, 10.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid_search = GridSearchCV(
    LogisticRegression(max_iter=1000, random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

Official Documentation

For complete API reference and additional parameters: scikit-learn LogisticRegression Documentation

See Also

Build docs developers (and LLMs) love