Custom Features

Overview

While the baseline model achieves 98.5% accuracy, you may want to customize the system for:

Different types of content (social media posts, blog articles, etc.)
Domain-specific fake news detection (health, finance, politics)
Experimentation with new features and techniques
Performance optimization for your specific use case

This guide shows you how to extend and customize the fake news detector.

Customizing Text Preprocessing

The limpiar_texto function (fake_news_ia.py:54-69) is the foundation of the pipeline. Here’s how to extend it:

Current Preprocessing Pipeline

def limpiar_texto(texto):
    # 1. Remove metadata
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
    
    # 2. Lowercase
    texto = str(texto).lower()
    
    # 3. Remove punctuation and numbers
    texto = re.sub(r'[^a-z\s]', '', texto) 
    
    # 4. Tokenize
    tokens = texto.split() 

    # 5. Remove stopwords and short tokens
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)

Extension 1: Preserve Capitalization Patterns

Fake news often has unusual capitalization (“BREAKING NEWS!!!”, “SHOCKING Discovery”):

def limpiar_texto_extended(texto):
    # NEW: Count capitalization before lowercasing
    all_caps_words = len(re.findall(r'\b[A-Z]{2,}\b', texto))
    
    # 1. Remove metadata
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
    
    # 2. Lowercase
    texto = str(texto).lower()
    
    # Add capitalization marker if excessive
    if all_caps_words > 3:
        texto = "CAPSMARKER " + texto
    
    # ... rest of pipeline
    texto = re.sub(r'[^a-z\s]', '', texto)
    tokens = texto.split()
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)

This adds a special token when articles have excessive capitalization.

Extension 2: Preserve Exclamation Marks

Fake news often uses sensational punctuation:

def limpiar_texto_with_emphasis(texto):
    # Count exclamation marks before removing punctuation
    exclamation_count = texto.count('!')
    
    # Standard preprocessing
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
    texto = str(texto).lower()
    
    # Add emphasis marker
    if exclamation_count > 2:
        texto = "emphasismarker " + texto
    
    texto = re.sub(r'[^a-z\s]', '', texto)
    tokens = texto.split()
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)

Extension 3: Add Stemming or Lemmatization

Reduce words to their root forms:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def limpiar_texto_stemmed(texto):
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
    texto = str(texto).lower()
    texto = re.sub(r'[^a-z\s]', '', texto)
    tokens = texto.split()
    
    # Remove stopwords
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    
    # NEW: Apply stemming
    tokens = [stemmer.stem(t) for t in tokens]
    
    return " ".join(tokens)

Stemming can reduce vocabulary size and may improve performance, but test on your specific dataset - sometimes it reduces accuracy by removing meaningful distinctions.

Customizing TF-IDF Vectorization

The vectorizer (fake_news_ia.py:82) has several tunable parameters:

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

Experiment with N-gram Range

N-grams capture multi-word phrases:

Configuration	Captures	Example Features
`ngram_range=(1, 1)`	Single words only	”president”, “announced”, “policy”
`ngram_range=(1, 2)`	Current: Words + bi-grams	”president”, “president announced”
`ngram_range=(1, 3)`	Words + bi-grams + tri-grams	”president announced policy”
`ngram_range=(2, 2)`	Only bi-grams	”president announced”, “new policy”

Try tri-grams:

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))

Larger n-gram ranges increase feature dimensionality and training time. Start with (1, 2) and only increase if you have sufficient training data (50k+ examples).

Adjust Maximum Features

The max_features=5000 parameter limits vocabulary size:

# Smaller vocabulary (faster, less memory, may reduce accuracy)
vectorizer = TfidfVectorizer(max_features=3000, ngram_range=(1, 2))

# Larger vocabulary (slower, more memory, may improve accuracy)
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))

# No limit (use entire vocabulary - not recommended for large datasets)
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

Impact on the model:

More features: Model can capture more nuanced patterns, but risks overfitting
Fewer features: Faster training, lower memory, but may miss subtle signals

The current 5,000 features is a well-tested balance. Only change if you have a specific reason (e.g., limited memory, specialized domain with small vocabulary).

Add Min/Max Document Frequency

Filter out very rare or very common terms:

vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    min_df=2,      # Ignore terms appearing in fewer than 2 documents
    max_df=0.95    # Ignore terms appearing in more than 95% of documents
)

This can improve robustness by removing noise.

Alternative Classifiers

The current model uses Logistic Regression (fake_news_ia.py:95):

modelo = LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)

Why Logistic Regression Works Well

Fast training: Trains in seconds even on 40k+ articles
Interpretable: Feature weights show which words indicate fake/real
Low memory: Small model size (~5MB for 5,000 features)
Excellent baseline: Often achieves 95%+ accuracy on text classification
No overfitting risk: Linear models generalize well with TF-IDF features

Experiment with Other Classifiers

1. Random Forest

Can capture non-linear patterns:

from sklearn.ensemble import RandomForestClassifier

modelo = RandomForestClassifier(
    n_estimators=100,
    max_depth=50,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)
modelo.fit(X_train, y_train)

Pros: May improve accuracy by 0.5-1% Cons: Slower training, larger model size, less interpretable

2. Naive Bayes

Very fast, works well with text:

from sklearn.naive_bayes import MultinomialNB

modelo = MultinomialNB(alpha=1.0)
modelo.fit(X_train, y_train)

Pros: Extremely fast training, small model Cons: Usually 1-2% lower accuracy than Logistic Regression

3. Support Vector Machine (SVM)

Powerful for high-dimensional text data:

from sklearn.svm import LinearSVC

modelo = LinearSVC(max_iter=1000, random_state=42)
modelo.fit(X_train, y_train)

Pros: Often matches Logistic Regression accuracy Cons: Slower training on large datasets

4. Gradient Boosting (XGBoost)

State-of-the-art ensemble method:

import xgboost as xgb

modelo = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)
modelo.fit(X_train, y_train)

Pros: May achieve highest accuracy Cons: Requires additional installation, slower, larger model

For this dataset, Logistic Regression already achieves 98.5% accuracy. Only switch classifiers if you have a specific need (e.g., need to squeeze out an extra 0.5% accuracy) or are experimenting for learning purposes.

Hyperparameter Tuning

Optimize the Logistic Regression parameters:

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.1, 1.0, 10.0],           # Regularization strength
    'solver': ['liblinear', 'lbfgs'],
    'max_iter': [500, 1000, 1500]
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    LogisticRegression(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best CV accuracy:", grid_search.best_score_)

# Use best model
modelo = grid_search.best_estimator_

Adding Custom Features

Combine TF-IDF with hand-crafted features:

import numpy as np
from scipy.sparse import hstack

def extract_custom_features(texts):
    """Extract readability and style features"""
    features = []
    
    for text in texts:
        # Average word length
        words = text.split()
        avg_word_len = np.mean([len(w) for w in words]) if words else 0
        
        # Sentence count (approximate)
        sentence_count = text.count('.') + text.count('!') + text.count('?')
        
        # Exclamation mark ratio
        exclamation_ratio = text.count('!') / max(len(text), 1)
        
        features.append([avg_word_len, sentence_count, exclamation_ratio])
    
    return np.array(features)

# Get TF-IDF features
X_tfidf = vectorizer.fit_transform(df["clean_text"])

# Get custom features
X_custom = extract_custom_features(df["full_text"])

# Combine both
X_combined = hstack([X_tfidf, X_custom])

# Train on combined features
X_train, X_test, y_train, y_test = train_test_split(
    X_combined, df["label"], test_size=0.2, random_state=42
)

modelo.fit(X_train, y_train)

If you add custom features, you must compute the same features in production and ensure they’re in the correct order when combining with TF-IDF vectors.

Testing Your Changes

Always validate modifications:

# After training your modified model
y_pred = modelo.predict(X_test)

from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Test on specific examples
test_news = [
    "BREAKING: Secret government plot revealed!!!",
    "The Federal Reserve announced interest rates will remain steady."
]

test_clean = [limpiar_texto(n) for n in test_news]
test_vec = vectorizer.transform(test_clean)
predictions = modelo.predict(test_vec)

for news, pred in zip(test_news, predictions):
    print(f"\n{news[:60]}...")
    print(f"Predicted: {pred}")

Deployment Considerations

When deploying custom models:

Save all components: If you add custom features, save the feature extractor too
Document changes: Keep a changelog of modifications
Version control: Use git to track changes to preprocessing and model code
A/B testing: Run the new model alongside the baseline and compare performance
Monitor performance: Track accuracy on real-world data over time

Key Takeaways

The preprocessing function is highly extensible for domain-specific needs
TF-IDF parameters (n-grams, max_features) significantly impact performance
Logistic Regression is an excellent baseline - only change if you have a specific reason
Always validate changes on a held-out test set
Custom features can boost performance but add complexity
Document and version all customizations

Ready to contribute? Visit the GitHub repository to submit your improvements to the project.

Get Started

Core Concepts

Training Guide

Inference

Advanced

Overview

Customizing Text Preprocessing

Current Preprocessing Pipeline

Extension 1: Preserve Capitalization Patterns

Extension 2: Preserve Exclamation Marks

Extension 3: Add Stemming or Lemmatization

Customizing TF-IDF Vectorization

Experiment with N-gram Range

Adjust Maximum Features

Add Min/Max Document Frequency

Alternative Classifiers

Why Logistic Regression Works Well

Experiment with Other Classifiers

1. Random Forest

2. Naive Bayes

3. Support Vector Machine (SVM)

4. Gradient Boosting (XGBoost)

Hyperparameter Tuning

Adding Custom Features

Testing Your Changes

Deployment Considerations

Key Takeaways

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guide

Inference

Advanced

​Overview

​Customizing Text Preprocessing

​Current Preprocessing Pipeline

​Extension 1: Preserve Capitalization Patterns

​Extension 2: Preserve Exclamation Marks

​Extension 3: Add Stemming or Lemmatization

​Customizing TF-IDF Vectorization

​Experiment with N-gram Range

​Adjust Maximum Features

​Add Min/Max Document Frequency

​Alternative Classifiers

​Why Logistic Regression Works Well

​Experiment with Other Classifiers

​1. Random Forest

​2. Naive Bayes

​3. Support Vector Machine (SVM)

​4. Gradient Boosting (XGBoost)

​Hyperparameter Tuning

​Adding Custom Features

​Testing Your Changes

​Deployment Considerations

​Key Takeaways

Build docs developers (and LLMs) love

Overview

Customizing Text Preprocessing

Current Preprocessing Pipeline

Extension 1: Preserve Capitalization Patterns

Extension 2: Preserve Exclamation Marks

Extension 3: Add Stemming or Lemmatization

Customizing TF-IDF Vectorization

Experiment with N-gram Range

Adjust Maximum Features

Add Min/Max Document Frequency

Alternative Classifiers

Why Logistic Regression Works Well

Experiment with Other Classifiers

1. Random Forest

2. Naive Bayes

3. Support Vector Machine (SVM)

4. Gradient Boosting (XGBoost)

Hyperparameter Tuning

Adding Custom Features

Testing Your Changes

Deployment Considerations

Key Takeaways