Overview
The prediction pipeline loads the trained model and vectorizer artifacts, applies the same preprocessing used during training, and classifies new news articles as “real” or “fake”. This workflow is used in both the Streamlit web application (app.py) and the command-line prediction script (predict_news.py).
Prediction Workflow
1. Load Trained Models
import joblib
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')
stop_words = set(stopwords.words("english"))
Model Files Required: Both modelo_fake_news.pkl and vectorizer_tfidf.pkl must be present in the working directory. These are generated by running fake_news_ia.py.
2. Preprocess Input Text
noticia_limpia = limpiar_texto(noticia_input)
Applies the exact same limpiar_texto function used during training to ensure consistent preprocessing.
3. Vectorize Cleaned Text
noticia_vec = vectorizer.transform([noticia_limpia])
Use transform, not fit_transform: The vectorizer must only transform new text using the vocabulary learned during training. Never call fit() or fit_transform() on new data during prediction.
4. Get Prediction
prediccion = modelo.predict(noticia_vec)[0]
Classification result, either "real" or "fake". The [0] index extracts the single prediction from the returned array.
Complete Prediction Example
import joblib
import re
from nltk.corpus import stopwords
# Load models
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')
stop_words = set(stopwords.words("english"))
# Define preprocessing function (must match training)
def limpiar_texto(texto):
texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
texto = str(texto).lower()
texto = re.sub(r'[^a-z\s]', '', texto)
tokens = texto.split()
tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
return " ".join(tokens)
# Classify a news article
new_article = "The Federal Reserve announced on Wednesday that it will maintain the benchmark interest rate within the current range of 5.25% to 5.50%, citing steady economic growth and easing inflation."
# Prediction pipeline
cleaned_text = limpiar_texto(new_article)
vectorized_text = vectorizer.transform([cleaned_text])
prediction = modelo.predict(vectorized_text)[0]
print(f"Prediction: {prediction.upper()}")
# Output: Prediction: REAL
Batch Prediction
For multiple articles, process them as a batch:
noticias_nuevas = [
"Article 1 text...",
"Article 2 text...",
"Article 3 text..."
]
# Apply preprocessing to all articles
noticias_limpias = [limpiar_texto(n) for n in noticias_nuevas]
# Vectorize all at once
noticias_vec = vectorizer.transform(noticias_limpias)
# Get all predictions
predicciones = modelo.predict(noticias_vec)
# Display results
for i, (noticia, prediccion) in enumerate(zip(noticias_nuevas, predicciones)):
print(f"Noticia {i+1}: {prediccion.upper()}")
Streamlit Integration
The web application implements the prediction pipeline with user interface:
if st.button("Clasificar Noticia"):
if noticia_input:
with st.spinner('Clasificando...'):
# 1. Clean the text
noticia_limpia = limpiar_texto(noticia_input)
# 2. Vectorize using trained vectorizer
noticia_vec = vectorizer.transform([noticia_limpia])
# 3. Get prediction
prediccion = modelo.predict(noticia_vec)[0]
# 4. Display result
if prediccion == 'real':
st.success(f"✅ La noticia es clasificada como **{prediccion.upper()}**")
st.balloons()
else:
st.error(f"❌ La noticia es clasificada como **{prediccion.upper()}**")
Error Handling
try:
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')
try:
stop_words = set(stopwords.words("english"))
except LookupError:
print("Error: Necesitas descargar las stopwords de NLTK.")
print("Ejecuta: python3 -c 'import nltk; nltk.download(\"stopwords\")'")
sys.exit()
except FileNotFoundError:
print("Error: Los archivos de modelo no se encontraron.")
print("Ejecuta 'fake_news_ia.py' primero para entrenar y guardar los modelos.")
sys.exit()
The model returns one of two string values:
"real" - Article is classified as legitimate news
"fake" - Article is classified as fake news
Prediction Confidence
To get prediction probabilities instead of binary classification:
# Get probability scores for each class
proba = modelo.predict_proba(noticia_vec)[0]
print(f"Fake probability: {proba[0]:.2%}")
print(f"Real probability: {proba[1]:.2%}")
The order of probabilities corresponds to the alphabetical order of class labels: ["fake", "real"]
Common Issues
Inconsistent Preprocessing
If the limpiar_texto function differs between training and prediction, the model will receive out-of-distribution inputs and produce unreliable predictions.
Solution: Use the exact same function definition in all scripts.
Missing NLTK Data
LookupError: Resource stopwords not found
Solution: Download NLTK stopwords corpus:
python3 -c 'import nltk; nltk.download("stopwords")'
Model Files Not Found
FileNotFoundError: modelo_fake_news.pkl
Solution: Run the training script first:
See Also