Overview
After training a machine learning model, you need to persist it to disk so it can be loaded in production without retraining. This page explains how the fake news detector usesjoblib to save and load both the trained classifier and the TF-IDF vectorizer.
Why Persistence Matters
Training the logistic regression model takes time and computational resources:- Loading ~40,000+ news articles
- Text preprocessing and cleaning
- TF-IDF vectorization with 5,000 features and bi-grams
- Model training over 1,000 iterations
- Train once (offline) → Save model artifacts
- Load artifacts (on app startup) → Use for all predictions
- Retrain periodically with new data → Update saved artifacts
What Gets Saved
The detector saves two critical objects (fake_news_ia.py:100-102):
1. The Trained Model (modelo_fake_news.pkl)
This is the LogisticRegression classifier after fitting on 80% of the dataset:
- Learned weights for each of the 5,000 TF-IDF features
- Intercept/bias term
- Hyperparameters (solver, max iterations, etc.)
2. The Fitted Vectorizer (vectorizer_tfidf.pkl)
This is the TfidfVectorizer after fitting on the training text:
- The vocabulary mapping (which 5,000 words/bi-grams were selected)
- IDF (Inverse Document Frequency) weights for each term
- Configuration (ngram_range, max_features, etc.)
Why Both Are Required
Consider what happens during prediction:- The new vectorizer has a different vocabulary (different 5,000 words)
- Feature indices don’t match what the model expects
- IDF weights are completely different
- Predictions will be garbage
The vectorizer must be fitted on the training data and transformed on new data. Never fit on production data!
Loading in Production
The Streamlit app (app.py:10-11) loads both artifacts at startup:
Error Handling
The try-except block ensures:- Clear error message if
.pklfiles are missing - App exits gracefully if models aren’t available
- Users are directed to train the model first
File Format: Why Joblib?
Python has several serialization options:| Library | Pros | Cons |
|---|---|---|
pickle | Built-in, simple | Slower for large numpy arrays |
joblib | Optimized for ML models, efficient compression | Requires installation |
json | Human-readable | Can’t serialize complex objects |
- Efficient handling of large numpy arrays (like TF-IDF matrices)
- Better compression for model weights
- Standard in scikit-learn workflows
- Compatible with all scikit-learn objects
Production Workflow
Here’s the complete training-to-deployment pipeline:Model Versioning
For production systems, implement versioning:Retraining Strategy
When to retrain and update saved models:- New data available: More fake/real news examples collected
- Performance degradation: Accuracy drops on recent articles
- Periodic schedule: Monthly or quarterly retraining
- Major events: Language patterns may shift after major news events
Deployment Checklist
Before deploying:-
modelo_fake_news.pklexists and loads without errors -
vectorizer_tfidf.pklexists and loads without errors - Both files are in the same directory as
app.py - Test prediction on sample inputs works correctly
- Model version is documented
- Backup of previous model version is saved
Common Issues
Issue: “No module named ‘sklearn’”
Cause: Scikit-learn version mismatch between training and production Solution: Use the same Python environment and dependency versions:Issue: Model predictions are random
Cause: Using a different vectorizer or creating a new one Solution: Always load the savedvectorizer_tfidf.pkl - never create a new vectorizer in production
Key Takeaways
- Save both the model and vectorizer with
joblib.dump() - Load both in production with
joblib.load() - Never fit a new vectorizer in production - always use the saved one
- Implement versioning for production systems
- Test loaded models before deployment
Next: Learn how to extend the system with Custom Features.