Skip to main content

Overview

The bootcamp uses real-world and curated datasets to teach data science concepts. Each dataset is carefully selected to demonstrate specific techniques and business scenarios.
All datasets are included in the source repository under their respective module folders and the Bootcamp-main/ directory.

Project Datasets by Module

These are the primary datasets used in module projects:
Location: 002_A3/PROYECTO/Format: CSV and Excel (.xlsx)Description: E-commerce customer data used for practicing Pandas data manipulation, cleaning, and transformation techniques.Use Cases:
  • Data loading and exploration
  • Handling missing values
  • Data type conversions
  • Creating derived features
  • Data aggregation and grouping
  • Exporting cleaned data
Related Files:
  • clientes_ecommerce.csv - Raw customer data
  • clientes_ecommerce.xlsx - Excel version
  • dataset_consolidado.csv - Merged dataset
  • dataset_limpio.csv - Cleaned version
  • dataset_final.csv/xlsx - Final processed data
  • dataset_wrangle.csv - Data wrangling practice
Key Fields: Typically includes customer demographics, transaction history, and behavioral metrics.
Location: 003_A4/PROYECTO/Format: CSVDescription: E-commerce sales dataset for the “Comercio Ya” project, focused on exploratory data analysis and visualization.Use Cases:
  • Univariate and multivariate analysis
  • Time series analysis of sales trends
  • Customer segmentation visualization
  • Statistical summaries and distributions
  • Creating business insights dashboards
  • Correlation analysis
Analysis Techniques:
  • Descriptive statistics
  • Matplotlib and Seaborn visualizations
  • Box plots, histograms, and scatter plots
  • Heatmaps for correlation matrices
Project Goal: Analyze sales patterns and customer behavior to provide actionable business recommendations.
Location: 005_A6/PROYECTO/Format: CSVDescription: Amazon sales data used for building machine learning models to predict customer spending patterns.Use Cases:
  • Data preprocessing and feature engineering
  • Training supervised learning models
  • Regression analysis for spend prediction
  • Model evaluation and comparison
  • Cross-validation techniques
  • Hyperparameter tuning
ML Techniques Applied:
  • Linear regression
  • K-Nearest Neighbors
  • Decision trees
  • Random forests
  • Gradient boosting
Related Dataset:
  • Actual_transactions_from_UK_retailer.csv - Alternative retail dataset
  • clientes_preprocesados.csv - Preprocessed customer data
Project: proyecto_modulo6_gasto_clientes_completo.ipynb
Location: 006_A7/PROYECTO/Format: CSV (split into training and testing sets)Description: Pre-split dataset for advanced machine learning techniques, focusing on proper model validation.Use Cases:
  • Unsupervised learning (clustering, PCA)
  • Advanced feature engineering
  • Model pipeline development
  • Handling train/test split properly
  • Avoiding data leakage
  • Ensemble methods
Key Concepts:
  • The data is already split to simulate real-world ML workflows
  • Training set used for model development
  • Test set reserved for final evaluation
  • Proper validation strategies
Analysis Focus: Advanced ML techniques including dimensionality reduction and unsupervised learning.
Location: Loaded programmatically via TensorFlow/PyTorchFormat: Built-in dataset (28x28 grayscale images)Description: Fashion MNIST is a dataset of 70,000 fashion product images (60,000 training + 10,000 testing) across 10 categories.Categories:
  1. T-shirt/top
  2. Trouser
  3. Pullover
  4. Dress
  5. Coat
  6. Sandal
  7. Shirt
  8. Sneaker
  9. Bag
  10. Ankle boot
Use Cases:
  • Building neural networks from scratch
  • Image classification
  • Comparing Keras/TensorFlow vs PyTorch implementations
  • Understanding CNNs (Convolutional Neural Networks)
  • Training, validation, and testing workflows
  • Model optimization and regularization
Loading the Data:
# Keras/TensorFlow
from tensorflow.keras.datasets import fashion_mnist
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# PyTorch
from torchvision import datasets
train_data = datasets.FashionMNIST(root='data', train=True, download=True)
test_data = datasets.FashionMNIST(root='data', train=False, download=True)
Projects:
  • proyecto_mod8_keras.ipynb - Keras/TensorFlow implementation
  • proyecto_mod8_pytorch.ipynb - PyTorch implementation

Supporting Datasets

Additional datasets used throughout the bootcamp:

Module A3: Pandas Practice Datasets

Titanic Dataset

Files: Multiple locations in Bootcamp-main/Modulo3/Classic dataset for learning data cleaning, exploration, and basic ML.Skills: Handling missing values, categorical encoding, survival analysis

Employee Datasets

Files: 002_A3/L2_analisis_caso/empleados*.csv/xlsxHR data for practicing merge operations and data cleaning.Skills: Joining tables, data consolidation, analysis

Transaction Data

Files: 002_A3/L5_analisis_caso/transacciones_*.csvFinancial transaction data for time series and aggregation practice.Skills: Date/time handling, aggregations, transformations

Financial Data

Files: Bootcamp-main/Modulo3/Leccion3/AAPL_historico.csvApple stock historical data for time series analysis.Skills: Working with financial data, time series operations

Module A4: Visualization Datasets

House EDA

File: Bootcamp-main/Modulo4/Leccion4/data/house_eda.csvHousing data for exploratory analysis and visualization.

Sales Data

File: Bootcamp-main/Modulo4/Leccion1/data/ventas_simuladas.csvSimulated sales data for visualization practice.

Performance Reports

File: Bootcamp-main/Modulo3/Leccion6/data/reporte_rendimiento.csvDepartment performance data for creating reports.

E-commerce

File: Bootcamp-main/Modulo4/Leccion4/data/Ecommerce.csvE-commerce data for multivariate analysis.

Module A5: Statistical Datasets

Distributions Examples

Files: 004_A5/EJERCICIOS/Leccion3/distributions_examples.csvData demonstrating various probability distributions.

Lightbulb Exercise

File: 004_A5/EJERCICIOS/Ejercicio_bombillas.xlsxQuality control data for hypothesis testing practice.

Module A6: Machine Learning Datasets

Insurance

File: Bootcamp-main/Modulo6/Leccion4/insurance.csvMedical insurance cost prediction for regression practice.

Auto MPG

Files: Bootcamp-main/Modulo6/Leccion4/auto-mpg.csv, autos.csvCar fuel efficiency for regression modeling.

Bank Marketing

File: Bootcamp-main/Modulo6/Leccion5/bank-full.csvBank marketing campaign data for classification.

Breast Cancer

File: Bootcamp-main/Modulo6/Leccion5/Breast_Cancer.csvMedical diagnosis data for binary classification.

Wine Quality

File: Bootcamp-main/Modulo6/Leccion5/WineQT.csvWine quality ratings for multiclass classification.

Telco Churn

File: Bootcamp-main/Modulo6/Leccion7/WA_Fn-UseC_-Telco-Customer-Churn.csvCustomer churn prediction for decision trees and ensemble methods.

Sports Classification

Files: Bootcamp-main/Modulo6/Leccion5/deportes*.csvSports categorization for KNN classification practice.

Bank Clients

File: Bootcamp-main/Modulo6/Leccion5/clientes_banco.csvCustomer segmentation and classification.

Dataset Categories

By Domain

  • clientes_ecommerce (A3)
  • ecommerce_sales_data-2 (A4)
  • amazon_sales_dataset (A6)
  • UK retailer transactions (A6)
  • Various e-commerce files (Modulo3/4)
Common Fields: Customer ID, transaction date, product, price, quantity, category

Working with Datasets

Loading Data

import pandas as pd

# Load CSV
df = pd.read_csv('path/to/file.csv')

# Load with specific encoding
df = pd.read_csv('file.csv', encoding='utf-8')

# Load with different delimiter
df = pd.read_csv('file.csv', sep=';')

# Load large files in chunks
chunks = pd.read_csv('large_file.csv', chunksize=10000)
for chunk in chunks:
    process(chunk)

Data Quality Tips

1

Inspect First

Always use df.head(), df.info(), and df.describe() before starting analysis.
2

Handle Missing Values

Check for nulls with df.isnull().sum() and decide on strategy (drop, fill, interpolate).
3

Verify Data Types

Ensure columns have correct types (dates as datetime, numbers as numeric).
4

Check for Outliers

Use box plots and statistical methods to identify anomalies.
5

Document Changes

Keep track of data transformations for reproducibility.

Dataset Best Practices

Never Modify Original Files: Always work on copies of datasets. Keep raw data in a separate data/raw/ folder.
Use Version Control: Track changes to processed datasets. Name files descriptively: dataset_cleaned.csv, dataset_features_v2.csv.
Document Your Data: Create a data dictionary explaining each column, its type, and meaning. This is essential for projects.
project/
├── data/
│   ├── raw/              # Original, immutable data
│   ├── interim/          # Intermediate transformations
│   ├── processed/        # Final, model-ready data
│   └── external/         # Third-party data
├── notebooks/            # Jupyter notebooks
├── src/                  # Source code
└── reports/              # Generated analysis

Data Sources & Ethics

Dataset Origins

Curated Educational

Most datasets are educational versions created or modified for learning purposes.

Public Datasets

Some datasets (Titanic, Fashion MNIST) are well-known public datasets from Kaggle and ML libraries.

Simulated Data

Some datasets (e.g., ventas_simuladas.csv) are synthetically generated to teach specific concepts.

Anonymized Real Data

Where real data is used, it has been anonymized and aggregated for privacy.

Data Ethics

When working with data in real-world projects:
  • Always respect privacy and confidentiality
  • Check data licenses and usage terms
  • Be aware of bias in datasets
  • Consider ethical implications of your analyses
  • Follow data protection regulations (GDPR, etc.)

Next Steps

Jupyter Notebooks

See which notebooks use these datasets

Tools & Libraries

Learn how to work with these datasets

Glossary

Understand data science terminology

Contact Manager Project

Start with the first Python project

Build docs developers (and LLMs) love