Overview
This project implements a complete data preparation pipeline for an e-commerce use case using Python, NumPy, and Pandas. The workflow starts with data from multiple sources (generated data, CSV/Excel files, web scraping), then cleans, transforms, and consolidates them into a final dataset ready for analysis and modeling. Deliverables:- Python script with complete workflow (
datos.py) - Jupyter notebook with step-by-step explanation (
datos.ipynb) - Final structured dataset (
dataset_final.csvanddataset_final.xlsx) - Optional Streamlit dashboard for data exploration (
dashboard.py)
Project Structure
Data Preparation Workflow
1. Data Generation with NumPy
Generate synthetic e-commerce customer data using NumPy:- Customer ID
- Age (18-70 years)
- Purchase amount (500)
- Number of purchases (Poisson distribution)
- Customer segment (Bronze/Silver/Gold)
2. Loading from CSV and Excel
Load existing customer data from multiple file formats:3. Web Scraping
Extract additional data from web sources:4. Data Consolidation
Merge data from multiple sources:5. Data Cleaning
Handle missing values, duplicates, and data quality issues:6. Data Wrangling and Feature Engineering
Transform and create new features:7. Final Dataset Export
Export the final prepared dataset:Installation and Execution
1. Prerequisites
- Python 3.x
- Terminal access (CMD, PowerShell, Terminal)
- Git (optional)
2. Clone/Download Project
3. Create Virtual Environment
Windows:4. Install Dependencies
numpy- Numerical computingpandas- Data manipulationlxml- XML/HTML parsingrequests- HTTP requestsopenpyxl- Excel supportstreamlit- Dashboard (optional)
5. Run Main Script
dataset_final.csvdataset_final.xlsx- Various intermediate files
6. Run Jupyter Notebook (Optional)
7. Run Streamlit Dashboard (Optional)
Streamlit Dashboard
Interactive dashboard for exploring the prepared data:Key Learning Outcomes
-
NumPy Data Generation
- Creating synthetic datasets
- Random number generation
- Array operations and saving
-
Multi-source Data Loading
- CSV and Excel file reading
- NumPy array conversion
- Web scraping with requests/lxml
-
Data Consolidation
- Concatenating DataFrames
- Merging datasets
- Handling index conflicts
-
Data Cleaning
- Missing value imputation
- Duplicate removal
- Outlier detection and handling
-
Feature Engineering
- Creating derived features
- Binning continuous variables
- Data type transformations
-
Data Export
- Multiple format support
- Column selection and ordering
- Best practices for data storage
Best Practices Demonstrated
- Reproducible workflow: Complete pipeline in Python script
- Documentation: Jupyter notebook with explanations
- Intermediate files: Save data at each stage for debugging
- Error handling: Validate data at each step
- Visualization: Interactive dashboard for exploration
- Version control: Track changes to data preparation code