Overview
The bootcamp uses real-world and curated datasets to teach data science concepts. Each dataset is carefully selected to demonstrate specific techniques and business scenarios.All datasets are included in the source repository under their respective module folders and the
Bootcamp-main/ directory.Project Datasets by Module
These are the primary datasets used in module projects:Module A3: clientes_ecommerce (CSV/XLSX)
Module A3: clientes_ecommerce (CSV/XLSX)
Location:
002_A3/PROYECTO/Format: CSV and Excel (.xlsx)Description:
E-commerce customer data used for practicing Pandas data manipulation, cleaning, and transformation techniques.Use Cases:- Data loading and exploration
- Handling missing values
- Data type conversions
- Creating derived features
- Data aggregation and grouping
- Exporting cleaned data
clientes_ecommerce.csv- Raw customer dataclientes_ecommerce.xlsx- Excel versiondataset_consolidado.csv- Merged datasetdataset_limpio.csv- Cleaned versiondataset_final.csv/xlsx- Final processed datadataset_wrangle.csv- Data wrangling practice
Module A4: ecommerce_sales_data-2.csv
Module A4: ecommerce_sales_data-2.csv
Location:
003_A4/PROYECTO/Format: CSVDescription:
E-commerce sales dataset for the “Comercio Ya” project, focused on exploratory data analysis and visualization.Use Cases:- Univariate and multivariate analysis
- Time series analysis of sales trends
- Customer segmentation visualization
- Statistical summaries and distributions
- Creating business insights dashboards
- Correlation analysis
- Descriptive statistics
- Matplotlib and Seaborn visualizations
- Box plots, histograms, and scatter plots
- Heatmaps for correlation matrices
Module A6: amazon_sales_dataset.csv
Module A6: amazon_sales_dataset.csv
Location:
005_A6/PROYECTO/Format: CSVDescription:
Amazon sales data used for building machine learning models to predict customer spending patterns.Use Cases:- Data preprocessing and feature engineering
- Training supervised learning models
- Regression analysis for spend prediction
- Model evaluation and comparison
- Cross-validation techniques
- Hyperparameter tuning
- Linear regression
- K-Nearest Neighbors
- Decision trees
- Random forests
- Gradient boosting
Actual_transactions_from_UK_retailer.csv- Alternative retail datasetclientes_preprocesados.csv- Preprocessed customer data
proyecto_modulo6_gasto_clientes_completo.ipynbModule A7: Train.csv & Test.csv
Module A7: Train.csv & Test.csv
Location:
006_A7/PROYECTO/Format: CSV (split into training and testing sets)Description:
Pre-split dataset for advanced machine learning techniques, focusing on proper model validation.Use Cases:- Unsupervised learning (clustering, PCA)
- Advanced feature engineering
- Model pipeline development
- Handling train/test split properly
- Avoiding data leakage
- Ensemble methods
- The data is already split to simulate real-world ML workflows
- Training set used for model development
- Test set reserved for final evaluation
- Proper validation strategies
Module A8: Fashion MNIST
Module A8: Fashion MNIST
Location: Loaded programmatically via TensorFlow/PyTorchFormat: Built-in dataset (28x28 grayscale images)Description:
Fashion MNIST is a dataset of 70,000 fashion product images (60,000 training + 10,000 testing) across 10 categories.Categories:Projects:
- T-shirt/top
- Trouser
- Pullover
- Dress
- Coat
- Sandal
- Shirt
- Sneaker
- Bag
- Ankle boot
- Building neural networks from scratch
- Image classification
- Comparing Keras/TensorFlow vs PyTorch implementations
- Understanding CNNs (Convolutional Neural Networks)
- Training, validation, and testing workflows
- Model optimization and regularization
proyecto_mod8_keras.ipynb- Keras/TensorFlow implementationproyecto_mod8_pytorch.ipynb- PyTorch implementation
Supporting Datasets
Additional datasets used throughout the bootcamp:Module A3: Pandas Practice Datasets
Titanic Dataset
Files: Multiple locations in
Bootcamp-main/Modulo3/Classic dataset for learning data cleaning, exploration, and basic ML.Skills: Handling missing values, categorical encoding, survival analysisEmployee Datasets
Files:
002_A3/L2_analisis_caso/empleados*.csv/xlsxHR data for practicing merge operations and data cleaning.Skills: Joining tables, data consolidation, analysisTransaction Data
Files:
002_A3/L5_analisis_caso/transacciones_*.csvFinancial transaction data for time series and aggregation practice.Skills: Date/time handling, aggregations, transformationsFinancial Data
Files:
Bootcamp-main/Modulo3/Leccion3/AAPL_historico.csvApple stock historical data for time series analysis.Skills: Working with financial data, time series operationsModule A4: Visualization Datasets
House EDA
File:
Bootcamp-main/Modulo4/Leccion4/data/house_eda.csvHousing data for exploratory analysis and visualization.Sales Data
File:
Bootcamp-main/Modulo4/Leccion1/data/ventas_simuladas.csvSimulated sales data for visualization practice.Performance Reports
File:
Bootcamp-main/Modulo3/Leccion6/data/reporte_rendimiento.csvDepartment performance data for creating reports.E-commerce
File:
Bootcamp-main/Modulo4/Leccion4/data/Ecommerce.csvE-commerce data for multivariate analysis.Module A5: Statistical Datasets
Distributions Examples
Files:
004_A5/EJERCICIOS/Leccion3/distributions_examples.csvData demonstrating various probability distributions.Lightbulb Exercise
File:
004_A5/EJERCICIOS/Ejercicio_bombillas.xlsxQuality control data for hypothesis testing practice.Module A6: Machine Learning Datasets
Insurance
File:
Bootcamp-main/Modulo6/Leccion4/insurance.csvMedical insurance cost prediction for regression practice.Auto MPG
Files:
Bootcamp-main/Modulo6/Leccion4/auto-mpg.csv, autos.csvCar fuel efficiency for regression modeling.Bank Marketing
File:
Bootcamp-main/Modulo6/Leccion5/bank-full.csvBank marketing campaign data for classification.Breast Cancer
File:
Bootcamp-main/Modulo6/Leccion5/Breast_Cancer.csvMedical diagnosis data for binary classification.Wine Quality
File:
Bootcamp-main/Modulo6/Leccion5/WineQT.csvWine quality ratings for multiclass classification.Telco Churn
File:
Bootcamp-main/Modulo6/Leccion7/WA_Fn-UseC_-Telco-Customer-Churn.csvCustomer churn prediction for decision trees and ensemble methods.Sports Classification
Files:
Bootcamp-main/Modulo6/Leccion5/deportes*.csvSports categorization for KNN classification practice.Bank Clients
File:
Bootcamp-main/Modulo6/Leccion5/clientes_banco.csvCustomer segmentation and classification.Dataset Categories
By Domain
- E-commerce & Retail
- Financial
- Healthcare
- Images
- HR & Business
- clientes_ecommerce (A3)
- ecommerce_sales_data-2 (A4)
- amazon_sales_dataset (A6)
- UK retailer transactions (A6)
- Various e-commerce files (Modulo3/4)
Working with Datasets
Loading Data
Data Quality Tips
Handle Missing Values
Check for nulls with
df.isnull().sum() and decide on strategy (drop, fill, interpolate).Dataset Best Practices
Document Your Data: Create a data dictionary explaining each column, its type, and meaning. This is essential for projects.
Recommended Folder Structure
Data Sources & Ethics
Dataset Origins
Curated Educational
Most datasets are educational versions created or modified for learning purposes.
Public Datasets
Some datasets (Titanic, Fashion MNIST) are well-known public datasets from Kaggle and ML libraries.
Simulated Data
Some datasets (e.g.,
ventas_simuladas.csv) are synthetically generated to teach specific concepts.Anonymized Real Data
Where real data is used, it has been anonymized and aggregated for privacy.
Data Ethics
Next Steps
Jupyter Notebooks
See which notebooks use these datasets
Tools & Libraries
Learn how to work with these datasets
Glossary
Understand data science terminology
Contact Manager Project
Start with the first Python project