Overview
Natural Language Processing (NLP) preprocessing transforms raw news article text into clean, normalized tokens that the machine learning model can effectively analyze. This stage is critical for achieving the 98.5% accuracy by removing noise and focusing the model on meaningful content.The limpiar_texto Function
The core of the NLP pipeline is the limpiar_texto (clean text) function, which applies five sequential transformations:
fake_news_ia.py
Preprocessing Stages
1. Metadata and Source Removal
This is a CRITICAL ANTI-BIAS measure that significantly improves model generalization.
WASHINGTON (REUTERS) -NEW YORK (AP) -PARIS (AFP) -- Similar patterns with news agency names
Understanding source bias elimination
Understanding source bias elimination
If the model learns that articles starting with “REUTERS” or “AP” are usually real, it will rely on the source rather than the content to make predictions. This causes two problems:
- Poor generalization - The model fails on articles without source metadata
- Superficial learning - The model doesn’t learn actual fake news patterns
2. Lowercase Conversion
Converts all text to lowercase for normalization:- Treats “President”, “president”, and “PRESIDENT” as the same word
- Reduces vocabulary size (fewer unique tokens)
- Improves pattern recognition across different capitalization styles
Lowercasing is standard in text classification tasks where case doesn’t carry semantic meaning.
3. Punctuation and Number Removal
Removes all characters except lowercase letters and spaces:- Punctuation:
. , ! ? ; : ' " ( ) [ ] { } - Numbers:
0-9 - Special characters:
@ # $ % & * + = / \ | ~
4. Tokenization
Splits the cleaned text into individual words (tokens):- Splits on whitespace characters (spaces, tabs, newlines)
- Creates a list of individual words
- Removes extra spaces automatically
The function uses
split() rather than NLTK’s advanced tokenizers for speed. For this use case, simple whitespace splitting is sufficient.5. Stopword Filtering and Length Filter
The final step removes common words that don’t carry semantic meaning:- Stopword removal - Removes words like “the”, “is”, “at”, “of”, etc.
- Length filter - Removes single-character tokens (often artifacts from punctuation removal)
Stopwords Configuration
Stopwords are loaded from NLTK’s English corpus:fake_news_ia.py
Complete Example
Here’s a full transformation example:Application to Dataset
The cleaning function is applied to the combinedfull_text column:
fake_news_ia.py
- Each row’s
full_text(title + text) is passed tolimpiar_texto - The function returns cleaned, tokenized text
- Result is stored in new
clean_textcolumn - Original columns are preserved for inspection
Preprocessing in Production
The Streamlit app uses the identical preprocessing function:app.py
Why These Specific Steps?
Metadata Removal
Prevents source bias - model learns content patterns, not trusted sources
Lowercase
Normalizes text - “Trump” and “trump” are treated as the same word
Punctuation Removal
Reduces noise - focuses model on words, not formatting
Stopword Filtering
Removes common words - highlights meaningful content words
Performance Impact
Preprocessing directly contributes to the 98.5% accuracy:- Metadata removal - Prevents overfitting to source names (+3-5% accuracy)
- Stopword filtering - Reduces dimensionality, improves signal-to-noise ratio
- Normalization - Ensures consistent feature representation
- Noise removal - Focuses model on linguistic content patterns
Next Steps
Data Pipeline
Return to data loading and preparation
Model Training
Learn how cleaned text is vectorized and used for training