Batch Processing

Overview

Batch processing allows you to classify multiple news articles simultaneously, making it ideal for analyzing large datasets, news feeds, or content moderation workflows. This approach is significantly more efficient than processing articles one at a time.

Quick Start

The batch processing implementation is available in predict_news.py:46-70:

import joblib
import re
from nltk.corpus import stopwords

# Load models
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')
stop_words = set(stopwords.words("english"))

# Define preprocessing function
def limpiar_texto(texto):
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
    texto = str(texto).lower()
    texto = re.sub(r'[^a-z\s]', '', texto) 
    tokens = texto.split() 
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)

# List of articles to classify
noticias_nuevas = [
    "The Federal Reserve announced on Wednesday that it will maintain the benchmark interest rate...",
    "A secret meeting was held at the UN headquarters where delegates voted...",
    "President Joe Biden announced a new infrastructure plan..."
]

# Batch processing pipeline
noticias_limpias = [limpiar_texto(n) for n in noticias_nuevas]
noticias_vec = vectorizer.transform(noticias_limpias)
predicciones = modelo.predict(noticias_vec)

# Display results
for i, (noticia, prediccion) in enumerate(zip(noticias_nuevas, predicciones)):
    print(f"\nNoticia {i+1} (Inicio): {noticia[:50]}...")
    print(f"Predicción: {prediccion.upper()}")

Complete Example

Here’s the full batch processing example from the source code:

Prepare Article List

Create a list of news articles to classify (predict_news.py:46-55):

noticias_nuevas = [
    # Real news example
    "The Federal Reserve announced on Wednesday that it will maintain the benchmark interest rate within the current range of 5.25% to 5.50%, citing steady economic growth and easing inflation. Federal Reserve Chair Jerome Powell stated during a press briefing in Washington that future rate decisions will depend on labor market data and inflation trends over the coming months.",
    
    # Fake news example
    "A secret meeting was held at the UN headquarters where delegates voted to replace all sugary drinks with green juice to boost the global population by 500 years.",
    
    # Another real news example
    "President Joe Biden announced a new infrastructure plan, stating, 'This investment will create millions of jobs across the country.'",
    
    # Trade agreement example
    "The European Union formally approved a new trade agreement with Canada on Thursday following a vote in the European Parliament in Brussels. Officials said the agreement is expected to strengthen economic cooperation and reduce tariffs on industrial goods over the next five years.",
    
    # WHO example
    "The World Health Organization reported on Monday that global vaccination rates have increased by 12 percent compared to last year, according to data collected from member states. WHO Director-General Tedros Adhanom Ghebreyesus emphasized the importance of continued international cooperation to prevent future outbreaks.",
    
    # Tech news example
    "Apple Inc. unveiled its latest software update during a developer conference in California on Tuesday. The update introduces enhanced security features, improved battery management, and performance optimizations for supported devices. The company stated that the update will be available to the public next month.",
    
    # Science example
    "For the first time, scientists are tracking the migration of monarch butterflies across much of North America, actively monitoring individual insects on journeys from as far away as Ontario all the way to their overwintering colonies in central Mexico."
]

Clean All Articles

Apply preprocessing to all articles using list comprehension (predict_news.py:60):

noticias_limpias = [limpiar_texto(n) for n in noticias_nuevas]

This efficiently processes all articles through the same cleaning pipeline used during training.

Vectorize All Articles

Transform all cleaned articles into TF-IDF vectors (predict_news.py:61):

noticias_vec = vectorizer.transform(noticias_limpias)

Use transform(), not fit_transform(). The vectorizer was already fitted during training.

Predict All Classifications

Generate predictions for all articles at once (predict_news.py:64):

predicciones = modelo.predict(noticias_vec)

Returns a numpy array with predictions for each article.

Display Results

Iterate through articles and their predictions (predict_news.py:67-69):

for i, (noticia, prediccion) in enumerate(zip(noticias_nuevas, predicciones)):
    print(f"\nNoticia {i+1} (Inicio): {noticia[:50]}...")
    print(f"Predicción: {prediccion.upper()}")

Expected Output

When running the batch processing script:

Modelos cargados exitosamente. Listo para clasificar. ✅

--- Clasificación de Noticias Nuevas (Rápida) ---

Noticia 1 (Inicio): The Federal Reserve announced on Wednesday that i...
Predicción: REAL

Noticia 2 (Inicio): A secret meeting was held at the UN headquarters...
Predicción: FAKE

Noticia 3 (Inicio): President Joe Biden announced a new infrastructu...
Predicción: REAL

Noticia 4 (Inicio): The European Union formally approved a new trade...
Predicción: REAL

Noticia 5 (Inicio): The World Health Organization reported on Monday...
Predicción: REAL

Noticia 6 (Inicio): Apple Inc. unveiled its latest software update d...
Predicción: REAL

Noticia 7 (Inicio): For the first time, scientists are tracking the ...
Predicción: REAL

--- FIN DEL PROYECTO ---

Performance Benefits

Batch processing offers significant advantages:

Approach	Processing Time	Use Case
Single predictions	~50ms per article	Interactive UI, single requests
Batch processing	~5ms per article	Large datasets, scheduled jobs

Processing 1000 articles in batch takes ~5 seconds vs ~50 seconds individually (10x faster).

Advanced Usage

Processing from CSV Files

Read articles from a CSV and classify them:

import pandas as pd
import joblib
import re
from nltk.corpus import stopwords

# Load models
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')
stop_words = set(stopwords.words("english"))

def limpiar_texto(texto):
    texto = re.sub(r'([A-Z\s]+)\s*\((REUTERS|AP|AFP)\)\s*\-\s*', '', str(texto), flags=re.IGNORECASE)
    texto = str(texto).lower()
    texto = re.sub(r'[^a-z\s]', '', texto) 
    tokens = texto.split() 
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    return " ".join(tokens)

# Load articles from CSV
df = pd.read_csv('articles_to_classify.csv')

# Batch process
df['clean_text'] = df['article_text'].apply(limpiar_texto)
X_vec = vectorizer.transform(df['clean_text'])
df['prediction'] = modelo.predict(X_vec)

# Save results
df.to_csv('classified_articles.csv', index=False)
print(f"Classified {len(df)} articles")
print(df['prediction'].value_counts())

Adding Confidence Scores

Get probability scores along with predictions:

# Get prediction probabilities
predicciones_proba = modelo.predict_proba(noticias_vec)
predicciones = modelo.predict(noticias_vec)

# Display with confidence scores
for i, (noticia, pred, proba) in enumerate(zip(noticias_nuevas, predicciones, predicciones_proba)):
    confidence = max(proba) * 100
    print(f"\nNoticia {i+1}: {noticia[:50]}...")
    print(f"Predicción: {pred.upper()} (Confianza: {confidence:.1f}%)")

Filtering by Classification

Separate articles by their classification:

# Process batch
noticias_limpias = [limpiar_texto(n) for n in noticias_nuevas]
noticias_vec = vectorizer.transform(noticias_limpias)
predicciones = modelo.predict(noticias_vec)

# Separate by classification
real_news = [n for n, p in zip(noticias_nuevas, predicciones) if p == 'real']
fake_news = [n for n, p in zip(noticias_nuevas, predicciones) if p == 'fake']

print(f"Real articles: {len(real_news)}")
print(f"Fake articles: {len(fake_news)}")

Integration Patterns

API Endpoint for Batch Processing

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load models once at startup
modelo = joblib.load('modelo_fake_news.pkl')
vectorizer = joblib.load('vectorizer_tfidf.pkl')

@app.route('/classify/batch', methods=['POST'])
def classify_batch():
    data = request.json
    articles = data.get('articles', [])
    
    if not articles:
        return jsonify({'error': 'No articles provided'}), 400
    
    # Batch processing
    cleaned = [limpiar_texto(a) for a in articles]
    vectorized = vectorizer.transform(cleaned)
    predictions = modelo.predict(vectorized)
    
    # Format response
    results = [
        {'article': art[:100], 'classification': pred}
        for art, pred in zip(articles, predictions)
    ]
    
    return jsonify({
        'total': len(articles),
        'results': results
    })

if __name__ == '__main__':
    app.run(debug=True)

Scheduled News Feed Monitoring

import feedparser
import schedule
import time

def monitor_news_feed():
    # Fetch RSS feed
    feed = feedparser.parse('https://example.com/rss')
    
    # Extract articles
    articles = [entry.summary for entry in feed.entries]
    
    # Batch classify
    cleaned = [limpiar_texto(a) for a in articles]
    vectorized = vectorizer.transform(cleaned)
    predictions = modelo.predict(vectorized)
    
    # Alert on fake news
    for article, pred in zip(articles, predictions):
        if pred == 'fake':
            print(f"ALERT: Potential fake news detected: {article[:100]}...")

# Run every hour
schedule.every(1).hours.do(monitor_news_feed)

while True:
    schedule.run_pending()
    time.sleep(60)

Best Practices

Memory Efficiency: For very large batches (10,000+ articles), process in chunks of 1,000 to avoid memory issues.

Preprocessing Consistency: Always use the exact same limpiar_texto function from training to ensure accurate results.

Transform vs Fit: Always use vectorizer.transform(), never fit_transform(), when processing new data.

Next Steps

Try the Streamlit web interface for interactive testing
Learn about programmatic predictions for single articles
Explore model training in the Training Guide

Get Started

Core Concepts

Training Guide

Inference

Advanced

Overview

Quick Start

Complete Example

Expected Output

Performance Benefits

Advanced Usage

Processing from CSV Files

Adding Confidence Scores

Filtering by Classification

Integration Patterns

API Endpoint for Batch Processing

Scheduled News Feed Monitoring

Best Practices

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guide

Inference

Advanced

​Overview

​Quick Start

​Complete Example

​Expected Output

​Performance Benefits

​Advanced Usage

​Processing from CSV Files

​Adding Confidence Scores

​Filtering by Classification

​Integration Patterns

​API Endpoint for Batch Processing

​Scheduled News Feed Monitoring

​Best Practices

​Next Steps

Build docs developers (and LLMs) love

Overview

Quick Start

Complete Example

Expected Output

Performance Benefits

Advanced Usage

Processing from CSV Files

Adding Confidence Scores

Filtering by Classification

Integration Patterns

API Endpoint for Batch Processing

Scheduled News Feed Monitoring

Best Practices

Next Steps