Skip to main content

Overview

The prediction pipeline loads the trained model and vectorizer artifacts, applies the same preprocessing used during training, and classifies new news articles as “real” or “fake”. This workflow is used in both the Streamlit web application (app.py) and the command-line prediction script (predict_news.py).

Prediction Workflow

1. Load Trained Models

app.py:9-12
import joblib

modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')
stop_words = set(stopwords.words("english"))
Model Files Required: Both modelo_fake_news.pkl and vectorizer_tfidf.pkl must be present in the working directory. These are generated by running fake_news_ia.py.

2. Preprocess Input Text

app.py:60
noticia_limpia = limpiar_texto(noticia_input)
Applies the exact same limpiar_texto function used during training to ensure consistent preprocessing.

3. Vectorize Cleaned Text

app.py:63
noticia_vec = vectorizer.transform([noticia_limpia])
Use transform, not fit_transform: The vectorizer must only transform new text using the vocabulary learned during training. Never call fit() or fit_transform() on new data during prediction.

4. Get Prediction

app.py:66
prediccion = modelo.predict(noticia_vec)[0]
prediccion
string
Classification result, either "real" or "fake". The [0] index extracts the single prediction from the returned array.

Complete Prediction Example

import joblib
import re
from nltk.corpus import stopwords

# Load models
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')
stop_words = set(stopwords.words("english"))

# Define preprocessing function (must match training)
def limpiar_texto(texto):
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
    texto = str(texto).lower()
    texto = re.sub(r'[^a-z\s]', '', texto)
    tokens = texto.split()
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)

# Classify a news article
new_article = "The Federal Reserve announced on Wednesday that it will maintain the benchmark interest rate within the current range of 5.25% to 5.50%, citing steady economic growth and easing inflation."

# Prediction pipeline
cleaned_text = limpiar_texto(new_article)
vectorized_text = vectorizer.transform([cleaned_text])
prediction = modelo.predict(vectorized_text)[0]

print(f"Prediction: {prediction.upper()}")
# Output: Prediction: REAL

Batch Prediction

For multiple articles, process them as a batch:
predict_news.py:60-64
noticias_nuevas = [
    "Article 1 text...",
    "Article 2 text...",
    "Article 3 text..."
]

# Apply preprocessing to all articles
noticias_limpias = [limpiar_texto(n) for n in noticias_nuevas]

# Vectorize all at once
noticias_vec = vectorizer.transform(noticias_limpias)

# Get all predictions
predicciones = modelo.predict(noticias_vec)

# Display results
for i, (noticia, prediccion) in enumerate(zip(noticias_nuevas, predicciones)):
    print(f"Noticia {i+1}: {prediccion.upper()}")

Streamlit Integration

The web application implements the prediction pipeline with user interface:
app.py:56-76
if st.button("Clasificar Noticia"):
    if noticia_input:
        with st.spinner('Clasificando...'):
            # 1. Clean the text
            noticia_limpia = limpiar_texto(noticia_input)
            
            # 2. Vectorize using trained vectorizer
            noticia_vec = vectorizer.transform([noticia_limpia])
            
            # 3. Get prediction
            prediccion = modelo.predict(noticia_vec)[0]
            
            # 4. Display result
            if prediccion == 'real':
                st.success(f"✅ La noticia es clasificada como **{prediccion.upper()}**")
                st.balloons()
            else:
                st.error(f"❌ La noticia es clasificada como **{prediccion.upper()}**")

Error Handling

predict_news.py:8-24
try:
    modelo = joblib.load('modelo_fake_news.pkl')
    vectorizer = joblib.load('vectorizer_tfidf.pkl')
    
    try:
        stop_words = set(stopwords.words("english"))
    except LookupError:
        print("Error: Necesitas descargar las stopwords de NLTK.")
        print("Ejecuta: python3 -c 'import nltk; nltk.download(\"stopwords\")'")
        sys.exit()
        
except FileNotFoundError:
    print("Error: Los archivos de modelo no se encontraron.")
    print("Ejecuta 'fake_news_ia.py' primero para entrenar y guardar los modelos.")
    sys.exit()

Return Value Format

prediction
string
The model returns one of two string values:
  • "real" - Article is classified as legitimate news
  • "fake" - Article is classified as fake news

Prediction Confidence

To get prediction probabilities instead of binary classification:
# Get probability scores for each class
proba = modelo.predict_proba(noticia_vec)[0]

print(f"Fake probability: {proba[0]:.2%}")
print(f"Real probability: {proba[1]:.2%}")
The order of probabilities corresponds to the alphabetical order of class labels: ["fake", "real"]

Common Issues

Inconsistent Preprocessing

If the limpiar_texto function differs between training and prediction, the model will receive out-of-distribution inputs and produce unreliable predictions.
Solution: Use the exact same function definition in all scripts.

Missing NLTK Data

LookupError: Resource stopwords not found
Solution: Download NLTK stopwords corpus:
python3 -c 'import nltk; nltk.download("stopwords")'

Model Files Not Found

FileNotFoundError: modelo_fake_news.pkl
Solution: Run the training script first:
python fake_news_ia.py

See Also

Build docs developers (and LLMs) love