Requirements
Before you begin, ensure you have:- Python 3.7 or higher
- pip (Python package manager)
- Git (optional, for cloning the repository)
- 2GB of free disk space (for datasets and NLTK data)
Installation steps
Clone or download the project
If you’re using Git, clone the repository:Otherwise, download and extract the project files to a directory.
Create a virtual environment
Create an isolated Python environment to avoid dependency conflicts:You should see
(.venv) in your terminal prompt indicating the virtual environment is active.Install Python dependencies
Install all required packages using pip:This installs:
- pandas - Data manipulation and CSV loading
- nltk - Natural Language Toolkit for text preprocessing
- scikit-learn - Machine learning library with TF-IDF and Logistic Regression
- joblib - Model serialization and persistence
- streamlit - Web application framework
Download NLTK data
The model requires NLTK’s stopwords and tokenizers. Download them with:This downloads:
- English stopwords (common words like “the”, “is”, “and” to be filtered out)
- Punkt tokenizer (for sentence and word tokenization)
NLTK data is downloaded to
~/nltk_data/ by default and is approximately 50MB.Verify installation
Confirm everything is installed correctly:Troubleshooting
NLTK stopwords not found
If you see an error about missing stopwords when running the scripts:Model files not found
If the Streamlit app shows:modelo_fake_news.pkl and vectorizer_tfidf.pkl required by the app.
CSV files not found
If training fails with:Package versions
The project is tested with:- pandas 1.3+
- nltk 3.6+
- scikit-learn 0.24+
- joblib 1.0+
- streamlit 1.0+