Tools & Libraries - Data Science Bootcamp

Overview

The bootcamp uses industry-standard tools and libraries for data science and machine learning. This page covers installation, configuration, and essential usage for each tool.

All tools are open-source and widely used in professional data science environments.

Core Technology Stack

Python

Version: 3.8+Core programming language

Jupyter

Tool: Jupyter Notebook/LabInteractive development environment

NumPy

Domain: Numerical ComputingArray operations and linear algebra

Pandas

Domain: Data ManipulationDataFrames and data analysis

Matplotlib

Domain: VisualizationStatic plotting library

Seaborn

Domain: VisualizationStatistical data visualization

scikit-learn

Domain: Machine LearningClassical ML algorithms

TensorFlow

Domain: Deep LearningNeural networks with Keras API

PyTorch

Domain: Deep LearningDynamic neural networks

Streamlit

Domain: Web AppsData app deployment

Keras

Domain: Deep LearningHigh-level neural network API

lxml

Domain: Data ParsingXML and HTML processing

Installation Guide

Method 1: Using pip (Recommended)

Install Python

Download Python 3.8 or higher from python.orgVerify installation:

python --version
# or
python3 --version

Create Virtual Environment

It’s best practice to use a virtual environment:

# Create virtual environment
python -m venv bootcamp-env

# Activate (Windows)
bootcamp-env\Scripts\activate

# Activate (Mac/Linux)
source bootcamp-env/bin/activate

Install Core Libraries

Install all required packages:

# Data manipulation and analysis
pip install numpy pandas

# Visualization
pip install matplotlib seaborn

# Machine Learning
pip install scikit-learn

# Deep Learning
pip install tensorflow keras torch torchvision

# Jupyter and utilities
pip install jupyter jupyterlab
pip install streamlit
pip install lxml requests

Verify Installation

Test that everything works:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import tensorflow as tf
import torch

print("All libraries imported successfully!")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Scikit-learn: {sklearn.__version__}")
print(f"TensorFlow: {tf.__version__}")
print(f"PyTorch: {torch.__version__}")

Method 2: Using Anaconda

Anaconda Installation (Alternative)

Anaconda provides a pre-packaged data science environment:

Download Anaconda from anaconda.com
Create a new environment:

conda create -n bootcamp python=3.9
conda activate bootcamp

Install packages:

# Most packages included with Anaconda
conda install numpy pandas matplotlib seaborn scikit-learn jupyter

# Deep learning frameworks
conda install -c conda-forge tensorflow
conda install pytorch torchvision -c pytorch

# Additional tools
conda install streamlit

Launch Jupyter:

jupyter notebook
# or
jupyter lab

Using Requirements Files

The bootcamp includes requirements.txt files in project folders:

# Navigate to project directory
cd source/002_A3/PROYECTO/

# Install all requirements
pip install -r requirements.txt

Library Reference

Python

Python 3.8+

Official Documentation: python.org/docPurpose: Core programming language for all bootcamp activitiesKey Features:

Easy-to-learn syntax
Extensive standard library
Rich ecosystem for data science
Cross-platform compatibility

Bootcamp Usage: Foundation for all modules (A1-A8)

Jupyter Notebook/Lab

Jupyter

Official Documentation: jupyter.orgPurpose: Interactive development environment for data scienceKey Features:

Combine code, text, and visualizations
Cell-by-cell execution
Rich output display (plots, tables, HTML)
Markdown support
Easy sharing and collaboration

Launch Commands:

# Classic Notebook
jupyter notebook

# JupyterLab (modern interface)
jupyter lab

# Open specific notebook
jupyter notebook path/to/notebook.ipynb

Bootcamp Usage: All 111+ notebooks (A1-A8)Tips:

Use JupyterLab for multi-file projects
Install extensions for enhanced functionality
Use %matplotlib inline for inline plots

NumPy

Official Documentation: numpy.orgPurpose: Numerical computing with multi-dimensional arraysKey Features:

N-dimensional array object (ndarray)
Broadcasting for vectorized operations
Linear algebra functions
Random number generation
Fast mathematical operations

Common Operations:

import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])

# Array operations
result = arr * 2
mean = np.mean(arr)

# Linear algebra
dot_product = np.dot(matrix, matrix.T)

Bootcamp Usage: Module A3 (NumPy fundamentals), foundation for all numerical workVersion Requirement: 1.19+

Pandas

Official Documentation: pandas.pydata.orgPurpose: Data manipulation and analysis with DataFramesKey Features:

DataFrame and Series data structures
Reading/writing various file formats (CSV, Excel, SQL)
Data cleaning and transformation
Group by operations
Time series functionality
Missing data handling

Common Operations:

import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Exploration
df.head()
df.info()
df.describe()

# Manipulation
df_clean = df.dropna()
df_grouped = df.groupby('category')['value'].sum()

# Save results
df.to_csv('output.csv', index=False)

Bootcamp Usage: Module A3 (primary focus), used throughout A4-A8Version Requirement: 1.2+

Matplotlib

Official Documentation: matplotlib.orgPurpose: Comprehensive plotting and visualizationKey Features:

Publication-quality figures
Multiple plot types (line, scatter, bar, histogram, etc.)
Fine-grained control over plot elements
Subplots and figure layouts
Save plots in various formats

Common Operations:

import matplotlib.pyplot as plt

# Basic plot
plt.plot(x, y)
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.title('My Plot')
plt.show()

# Subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(x, y)
ax2.scatter(x, z)

# Save figure
plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Bootcamp Usage: Module A4 (primary), used in A5-A8 for visualizing resultsVersion Requirement: 3.3+

Seaborn

Official Documentation: seaborn.pydata.orgPurpose: Statistical data visualization built on MatplotlibKey Features:

Beautiful default styles
Statistical plotting functions
Integration with Pandas DataFrames
Complex visualizations with less code
Color palettes and themes

Common Operations:

import seaborn as sns
import matplotlib.pyplot as plt

# Set style
sns.set_style('whitegrid')

# Statistical plots
sns.histplot(data=df, x='column', kde=True)
sns.boxplot(data=df, x='category', y='value')
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

# Pair plot for multivariate analysis
sns.pairplot(df, hue='target')

plt.show()

Bootcamp Usage: Module A4 (primary), enhances visualizations in A5-A8Version Requirement: 0.11+

scikit-learn

Official Documentation: scikit-learn.orgPurpose: Machine learning algorithms and toolsKey Features:

Classification, regression, clustering algorithms
Model selection and evaluation
Data preprocessing and feature engineering
Pipeline construction
Cross-validation tools
Extensive algorithm library

Common Operations:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

Key Algorithms Used:

Linear/Logistic Regression
K-Nearest Neighbors (KNN)
Decision Trees and Random Forests
Gradient Boosting
K-Means Clustering
PCA (Principal Component Analysis)

Bootcamp Usage: Modules A6-A7 (primary), introduction to ML workflowsVersion Requirement: 0.24+

TensorFlow & Keras

TensorFlow + Keras

Official Documentation:

Purpose: Deep learning framework with high-level Keras APIKey Features:

Sequential and Functional APIs
Pre-built layers and models
Automatic differentiation
GPU acceleration
Model saving and deployment
Extensive pre-trained models

Common Operations:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Build model
model = keras.Sequential([
    layers.Flatten(input_shape=(28, 28)),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(10, activation='softmax')
])

# Compile
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train
history = model.fit(
    X_train, y_train,
    epochs=10,
    validation_split=0.2,
    batch_size=32
)

# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test)

# Save
model.save('model.h5')

Bootcamp Usage: Module A8 (proyecto_mod8_keras.ipynb)Version Requirement: TensorFlow 2.4+

PyTorch

Official Documentation: pytorch.org/docsPurpose: Dynamic deep learning frameworkKey Features:

Dynamic computational graphs
Pythonic and intuitive API
Strong GPU acceleration
Extensive neural network modules
Popular in research
TorchVision for computer vision

Common Operations:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Define model
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28*28, 128)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Initialize
model = NeuralNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Training loop
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

Bootcamp Usage: Module A8 (proyecto_mod8_pytorch.ipynb)Version Requirement: PyTorch 1.8+, TorchVision 0.9+

Streamlit

Official Documentation: docs.streamlit.ioPurpose: Build and deploy data apps quicklyKey Features:

Pure Python - no HTML/CSS/JS required
Instant hot-reload
Interactive widgets
Built-in charting
Easy deployment
Session state management

Common Operations:

import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt

# Title and text
st.title('My Data App')
st.write('Welcome to my analysis!')

# Load and display data
df = pd.read_csv('data.csv')
st.dataframe(df)

# Interactive widgets
option = st.selectbox('Choose a column:', df.columns)
slider_value = st.slider('Select a value', 0, 100, 50)

# Display charts
st.line_chart(df[option])

# Matplotlib integration
fig, ax = plt.subplots()
ax.hist(df[option])
st.pyplot(fig)

Run Streamlit App:

streamlit run app.py

Bootcamp Usage: Used in multiple modules for creating interactive demosVersion Requirement: 1.0+

Additional Libraries

lxml

Purpose: XML and HTML processingUsed for parsing web data and working with Excel files

requests

Purpose: HTTP library for API callsUsed for fetching data from web APIs

yfinance

Purpose: Yahoo Finance dataUsed in Module A3 for financial data analysis

openpyxl

Purpose: Excel file supportBackend for Pandas Excel operations

Version Requirements

Recommended versions as of the bootcamp creation:

Python >= 3.8
numpy >= 1.19.0
pandas >= 1.2.0
matplotlib >= 3.3.0
seaborn >= 0.11.0
scikit-learn >= 0.24.0
tensorflow >= 2.4.0
keras >= 2.4.0
torch >= 1.8.0
torchvision >= 0.9.0
streamlit >= 1.0.0
jupyter >= 1.0.0
lxml >= 4.6.0
requests >= 2.25.0

Check Your Versions

import sys
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import sklearn
import tensorflow as tf
import torch
import streamlit as st

print(f"Python: {sys.version}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Matplotlib: {matplotlib.__version__}")
print(f"Seaborn: {sns.__version__}")
print(f"Scikit-learn: {sklearn.__version__}")
print(f"TensorFlow: {tf.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"Streamlit: {st.__version__}")

Troubleshooting

Import Errors

Problem: ModuleNotFoundError: No module named 'package'Solutions:

Install the package: pip install package-name
Check you’re using the correct Python environment
Restart Jupyter kernel after installation
Verify installation: pip list | grep package-name

TensorFlow/PyTorch Not Using GPU

Problem: Models training slowly on CPUSolutions:

Check GPU availability:

# TensorFlow
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

# PyTorch
import torch
print(torch.cuda.is_available())

Install GPU versions:

# TensorFlow GPU
pip install tensorflow-gpu

# PyTorch GPU (check pytorch.org for your system)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Install CUDA and cuDNN drivers

Jupyter Kernel Issues

Problem: Kernel keeps dying or won’t startSolutions:

Restart kernel: Kernel > Restart
Check for memory issues (close other apps)

Reinstall kernel:

pip install --upgrade jupyter ipykernel
python -m ipykernel install --user

Clear notebook output: Cell > All Output > Clear

Package Conflicts

Problem: Incompatible package versionsSolutions:

Create a fresh virtual environment
Install packages one by one to identify conflicts
Use pip install --upgrade package-name
Check compatibility with pip check

Matplotlib Plots Not Showing

Problem: Plots don’t display in JupyterSolution: Add this magic command at the start of your notebook:

%matplotlib inline

For interactive plots:

%matplotlib notebook
# or
%matplotlib widget

Additional Resources

Learning Resources

Python

Official Python Tutorial

NumPy

NumPy Quickstart

Pandas

10 Minutes to Pandas

Matplotlib

Matplotlib Tutorials

Scikit-learn

Scikit-learn Tutorials

TensorFlow

TensorFlow Tutorials

PyTorch

PyTorch Tutorials

Streamlit

Streamlit Get Started

Cheat Sheets

Quick reference guides:

Next Steps

Jupyter Notebooks

Explore the 111+ notebooks that use these tools

Datasets

Work with bootcamp datasets

Glossary

Learn data science terminology

Setup Guide

Get started with environment setup

Learning Resources

​Overview

​Core Technology Stack

Python

Jupyter

NumPy

Pandas

Matplotlib

Seaborn

scikit-learn

TensorFlow

PyTorch

Streamlit

Keras

lxml

​Installation Guide

​Method 1: Using pip (Recommended)

​Method 2: Using Anaconda

​Using Requirements Files

​Library Reference

​Python

Python 3.8+

​Jupyter Notebook/Lab

Jupyter

​NumPy

NumPy

​Pandas

Pandas

​Matplotlib

Matplotlib

​Seaborn

Seaborn

​scikit-learn

scikit-learn

​TensorFlow & Keras

TensorFlow + Keras

​PyTorch

PyTorch

​Streamlit

Streamlit

​Additional Libraries

lxml

requests

yfinance

openpyxl

​Version Requirements

​Check Your Versions

​Troubleshooting

​Additional Resources

​Learning Resources

Python

NumPy

Pandas

Matplotlib

Scikit-learn

TensorFlow

PyTorch

Streamlit

​Cheat Sheets

​Next Steps

Jupyter Notebooks

Datasets

Glossary

Setup Guide

Build docs developers (and LLMs) love

Overview

Core Technology Stack

Installation Guide

Method 1: Using pip (Recommended)

Method 2: Using Anaconda

Using Requirements Files

Library Reference

Python

Jupyter Notebook/Lab

NumPy

Pandas

Matplotlib

Seaborn

scikit-learn

TensorFlow & Keras

PyTorch

Streamlit

Additional Libraries

Version Requirements

Check Your Versions

Troubleshooting

Additional Resources

Learning Resources

Cheat Sheets

Next Steps