Skip to main content

Overview

Natural Language Processing (NLP) preprocessing transforms raw news article text into clean, normalized tokens that the machine learning model can effectively analyze. This stage is critical for achieving the 98.5% accuracy by removing noise and focusing the model on meaningful content.

The limpiar_texto Function

The core of the NLP pipeline is the limpiar_texto (clean text) function, which applies five sequential transformations:
fake_news_ia.py
def limpiar_texto(texto):
    # 1. REMOVE METADATA/SOURCE: Remove patterns like 'PLACE (AGENCY) - '
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', 
                   str(texto), flags=re.IGNORECASE)
    
    # 2. Convert to lowercase
    texto = str(texto).lower()
    
    # 3. Remove punctuation, numbers and special characters
    texto = re.sub(r'[^a-z\s]', '', texto) 
    
    # 4. Tokenization with split() 
    tokens = texto.split() 

    # 5. Filter stopwords and single-letter tokens
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)

Preprocessing Stages

1. Metadata and Source Removal

This is a CRITICAL ANTI-BIAS measure that significantly improves model generalization.
The first step removes news agency metadata patterns:
texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', 
               str(texto), flags=re.IGNORECASE)
What it removes:
  • WASHINGTON (REUTERS) -
  • NEW YORK (AP) -
  • PARIS (AFP) -
  • Similar patterns with news agency names
Why this matters:
If the model learns that articles starting with “REUTERS” or “AP” are usually real, it will rely on the source rather than the content to make predictions. This causes two problems:
  1. Poor generalization - The model fails on articles without source metadata
  2. Superficial learning - The model doesn’t learn actual fake news patterns
By removing source metadata, we force the model to analyze writing quality, factual consistency, and linguistic patterns instead of memorizing trusted sources.

2. Lowercase Conversion

Converts all text to lowercase for normalization:
texto = str(texto).lower()
Benefits:
  • Treats “President”, “president”, and “PRESIDENT” as the same word
  • Reduces vocabulary size (fewer unique tokens)
  • Improves pattern recognition across different capitalization styles
Lowercasing is standard in text classification tasks where case doesn’t carry semantic meaning.

3. Punctuation and Number Removal

Removes all characters except lowercase letters and spaces:
texto = re.sub(r'[^a-z\s]', '', texto)
What gets removed:
  • Punctuation: . , ! ? ; : ' " ( ) [ ] { }
  • Numbers: 0-9
  • Special characters: @ # $ % & * + = / \ | ~
Example transformation:
Before: "The President said, 'We've invested $5.2 billion in infrastructure!'"
After:  "the president said weve invested  billion in infrastructure"

4. Tokenization

Splits the cleaned text into individual words (tokens):
tokens = texto.split()
How it works:
  • Splits on whitespace characters (spaces, tabs, newlines)
  • Creates a list of individual words
  • Removes extra spaces automatically
Example:
texto = "president announced new infrastructure plan"
tokens = ["president", "announced", "new", "infrastructure", "plan"]
The function uses split() rather than NLTK’s advanced tokenizers for speed. For this use case, simple whitespace splitting is sufficient.

5. Stopword Filtering and Length Filter

The final step removes common words that don’t carry semantic meaning:
tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
Two filters applied:
  1. Stopword removal - Removes words like “the”, “is”, “at”, “of”, etc.
  2. Length filter - Removes single-character tokens (often artifacts from punctuation removal)

Stopwords Configuration

Stopwords are loaded from NLTK’s English corpus:
fake_news_ia.py
try:
    stop_words = set(stopwords.words("english"))
except LookupError:
    print("Error: Need to download NLTK stopwords.")
    print("Run: python3 -c 'import nltk; nltk.download(\"stopwords\")'")
    sys.exit()
NLTK stopwords must be downloaded before running the script. Use:
python3 -c 'import nltk; nltk.download("stopwords"); nltk.download("punkt")'
Example stopwords removed:
# Common English stopwords
['the', 'is', 'at', 'which', 'on', 'a', 'an', 'as', 'are', 
 'was', 'were', 'been', 'be', 'have', 'has', 'had', 'do', 
 'does', 'did', 'will', 'would', 'could', 'should', ...]

Complete Example

Here’s a full transformation example:
# Original text
original = "WASHINGTON (REUTERS) - President Biden announced a new $5.2 trillion infrastructure plan on Thursday, stating 'This will create millions of jobs!'"

# After Step 1: Metadata removal
# "President Biden announced a new $5.2 trillion infrastructure plan on Thursday, stating 'This will create millions of jobs!'"

# After Step 2: Lowercase
# "president biden announced a new $5.2 trillion infrastructure plan on thursday, stating 'this will create millions of jobs!'"

# After Step 3: Punctuation/number removal
# "president biden announced a new  trillion infrastructure plan on thursday stating this will create millions of jobs"

# After Step 4: Tokenization
# ["president", "biden", "announced", "a", "new", "trillion", "infrastructure", 
#  "plan", "on", "thursday", "stating", "this", "will", "create", "millions", 
#  "of", "jobs"]

# After Step 5: Stopword filtering
# ["president", "biden", "announced", "new", "trillion", "infrastructure", 
#  "plan", "thursday", "stating", "create", "millions", "jobs"]

clean = "president biden announced new trillion infrastructure plan thursday stating create millions jobs"

Application to Dataset

The cleaning function is applied to the combined full_text column:
fake_news_ia.py
# Apply cleaning to the column that combines title and text
df["clean_text"] = df["full_text"].apply(limpiar_texto)
print(df[["title", "text", "clean_text", "label"]].head())
Process flow:
  1. Each row’s full_text (title + text) is passed to limpiar_texto
  2. The function returns cleaned, tokenized text
  3. Result is stored in new clean_text column
  4. Original columns are preserved for inspection

Preprocessing in Production

The Streamlit app uses the identical preprocessing function:
app.py
def limpiar_texto(texto):
    # Same 5-step process as training script
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', 
                   str(texto), flags=re.IGNORECASE)
    texto = str(texto).lower()
    texto = re.sub(r'[^a-z\s]', '', texto) 
    tokens = texto.split() 
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)
Critical: The preprocessing function must be identical between training and inference. Any differences will cause prediction errors due to feature mismatch.

Why These Specific Steps?

Metadata Removal

Prevents source bias - model learns content patterns, not trusted sources

Lowercase

Normalizes text - “Trump” and “trump” are treated as the same word

Punctuation Removal

Reduces noise - focuses model on words, not formatting

Stopword Filtering

Removes common words - highlights meaningful content words

Performance Impact

Preprocessing directly contributes to the 98.5% accuracy:
  1. Metadata removal - Prevents overfitting to source names (+3-5% accuracy)
  2. Stopword filtering - Reduces dimensionality, improves signal-to-noise ratio
  3. Normalization - Ensures consistent feature representation
  4. Noise removal - Focuses model on linguistic content patterns

Next Steps

Data Pipeline

Return to data loading and preparation

Model Training

Learn how cleaned text is vectorized and used for training

Build docs developers (and LLMs) love