Data Science Glossary - Data Science Bootcamp

Overview

This glossary defines key terms and concepts used throughout the Data Science Bootcamp. Terms are organized alphabetically for quick reference.

Use Ctrl+F (or Cmd+F on Mac) to quickly search for specific terms.

A

Activation Function

A mathematical function applied to a neuron’s output in a neural network.Purpose: Introduces non-linearity, enabling networks to learn complex patterns.Common Types:

ReLU (Rectified Linear Unit): f(x) = max(0, x)
Sigmoid: f(x) = 1 / (1 + e^(-x))
Tanh: f(x) = tanh(x)
Softmax: Used for multi-class classification output

Module: A8 (Deep Learning)

Algorithm

A step-by-step procedure for solving a problem or performing a computation.In Data Science: Algorithms process data to learn patterns (machine learning algorithms) or perform calculations (sorting, searching algorithms).Examples: Linear regression, decision trees, K-means clusteringModules: A5-A8

Array

A data structure that holds multiple values of the same type in a grid-like structure.In NumPy: An n-dimensional array (ndarray) is the fundamental data structure.Example:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])  # 1D array
matrix = np.array([[1, 2], [3, 4]])  # 2D array

Properties: Shape, dtype (data type), sizeModule: A3 (NumPy)

B

Backpropagation

The algorithm used to calculate gradients in neural networks for updating weights.Process: Computes the gradient of the loss function with respect to each weight by applying the chain rule, propagating errors backwards through the network.Purpose: Enables neural networks to learn from mistakesModule: A8 (Deep Learning)

Batch Size

The number of training samples processed before updating model parameters.Trade-offs:

Large batch: Faster training, more memory, less noise
Small batch: Slower training, less memory, more exploration

Common Values: 32, 64, 128, 256Module: A8 (Deep Learning)

Bias (Statistical)

Systematic error that causes predictions to consistently deviate from true values.Example: A model trained only on data from one demographic might be biased against others.Related: Bias-Variance TradeoffModules: A6-A7

Bias (Neural Network)

An additional parameter in neural network layers that shifts the activation function.Formula: output = weights * input + biasPurpose: Allows the model to fit data that doesn’t pass through the originModule: A8

Broadcasting

NumPy’s mechanism for performing operations on arrays of different shapes.Example:

import numpy as np
arr = np.array([1, 2, 3])
result = arr + 10  # Adds 10 to each element
# Result: [11, 12, 13]

Module: A3 (NumPy)

C

Classification

A supervised learning task where the goal is to predict discrete categories or classes.Types:

Binary: Two classes (e.g., spam/not spam)
Multiclass: More than two classes (e.g., image classification)

Algorithms: Logistic regression, KNN, decision trees, neural networksModule: A6-A8

Clustering

An unsupervised learning technique that groups similar data points together.Common Algorithms:

K-Means: Partitions data into K clusters
Hierarchical: Creates a tree of clusters
DBSCAN: Density-based clustering

Use Cases: Customer segmentation, anomaly detection, data explorationModule: A7 (Advanced ML)

Confusion Matrix

A table showing the performance of a classification model.Structure:

            Predicted
          Pos    Neg
Actual Pos  TP     FN
       Neg  FP     TN

TP (True Positive): Correctly predicted positive
TN (True Negative): Correctly predicted negative
FP (False Positive): Incorrectly predicted positive (Type I error)
FN (False Negative): Incorrectly predicted negative (Type II error)

Derived Metrics: Accuracy, precision, recall, F1-scoreModule: A6

Cross-Validation

A technique for assessing model performance by splitting data into multiple train/test sets.K-Fold Cross-Validation:

Split data into K equal parts (folds)
Train on K-1 folds, test on the remaining fold
Repeat K times, using each fold as test set once
Average the results

Purpose: Provides more reliable performance estimates, reduces overfittingCommon Value: K=5 or K=10Module: A6 (Machine Learning)

D

DataFrame

Pandas’ primary data structure: a 2-dimensional labeled table with columns of potentially different types.Key Features:

Row and column labels (index)
Heterogeneous data types
Built-in methods for data manipulation
Easy filtering, grouping, and aggregation

Example:

import pandas as pd
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})

Module: A3 (Primary focus), used throughout A4-A8

Data Augmentation

Techniques to artificially increase training data by creating modified versions of existing data.For Images: Rotation, flipping, cropping, color adjustmentPurpose: Reduces overfitting, improves model generalizationModule: A8 (Deep Learning)

Data Cleaning

The process of detecting and correcting (or removing) corrupt or inaccurate data.Common Tasks:

Handling missing values
Removing duplicates
Correcting data types
Fixing inconsistencies
Removing outliers

Module: A3 (Primary), A4-A6

Decision Tree

A tree-like model that makes decisions by asking a series of questions about features.Structure:

Root node: First decision
Internal nodes: Subsequent decisions
Leaf nodes: Final predictions

Advantages: Interpretable, handles non-linear relationshipsDisadvantages: Prone to overfittingExtensions: Random Forest, Gradient BoostingModule: A6

Deep Learning

A subset of machine learning using neural networks with multiple layers (“deep” networks).Key Concepts:

Multiple hidden layers
Automatic feature learning
Requires large datasets
GPU acceleration

Applications: Image recognition, NLP, speech recognitionFrameworks: TensorFlow/Keras, PyTorchModule: A8

Dimensionality Reduction

Techniques to reduce the number of features while preserving important information.Methods:

PCA (Principal Component Analysis): Linear transformation
t-SNE: Non-linear, good for visualization
UMAP: Non-linear, preserves global structure

Benefits: Faster training, visualization, noise reductionModule: A7

E

EDA (Exploratory Data Analysis)

The process of analyzing datasets to summarize their main characteristics, often using visualizations.Goals:

Understand data structure
Detect patterns and anomalies
Test assumptions
Identify relationships between variables

Common Techniques:

Summary statistics (df.describe())
Visualizations (histograms, box plots, scatter plots)
Correlation analysis
Distribution analysis

Module: A4 (Primary focus)

Ensemble Methods

Techniques that combine multiple models to improve prediction accuracy.Types:

Bagging: Train models independently on random subsets (e.g., Random Forest)
Boosting: Train models sequentially, each correcting errors of previous (e.g., Gradient Boosting)
Stacking: Combine predictions from multiple models using a meta-model

Principle: “Wisdom of crowds” - multiple weak learners become a strong learnerModule: A6

Epoch

One complete pass through the entire training dataset during model training.Example: Training for 10 epochs means the model sees every training sample 10 times.Considerations:

Too few: Underfitting
Too many: Overfitting
Use early stopping to find optimal number

Module: A8 (Deep Learning)

F

Feature

An individual measurable property or characteristic of a phenomenon being observed.Also called: Variable, predictor, independent variable, attributeExample: In a house price dataset, features might include: square footage, number of bedrooms, location, ageTypes:

Numerical: Continuous (price) or discrete (count)
Categorical: Nominal (color) or ordinal (rating)

Modules: All modules A3-A8

Feature Engineering

The process of creating new features or transforming existing ones to improve model performance.Techniques:

Creating interaction terms
Binning continuous variables
Encoding categorical variables
Extracting date/time components
Polynomial features
Domain-specific transformations

Example:

# Create BMI from height and weight
df['BMI'] = df['weight'] / (df['height'] ** 2)

# Extract month from date
df['month'] = df['date'].dt.month

Module: A6-A7

Feature Scaling

Normalizing or standardizing features to bring them to a similar scale.Methods:

Standardization (Z-score): (x - mean) / std → Mean=0, Std=1
Min-Max Normalization: (x - min) / (max - min) → Range [0, 1]

Why Important: Many algorithms (KNN, neural networks) are sensitive to feature scalesExample:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Module: A6

F1 Score

The harmonic mean of precision and recall, providing a single metric that balances both.Formula: F1 = 2 * (precision * recall) / (precision + recall)Range: 0 to 1 (higher is better)When to Use: Imbalanced datasets where both false positives and false negatives are importantModule: A6

G

Gradient Descent

An optimization algorithm that iteratively adjusts model parameters to minimize the loss function.Process:

Calculate gradient (slope) of loss function
Update parameters in opposite direction of gradient
Repeat until convergence

Variants:

Batch GD: Uses entire dataset
Stochastic GD: Uses one sample at a time
Mini-batch GD: Uses small batches (most common)

Learning Rate: Controls step size (critical hyperparameter)Module: A6-A8

GridSearchCV

An exhaustive search over specified parameter values to find the best hyperparameters.Process:

Define parameter grid
Train model for each combination
Use cross-validation to evaluate
Return best parameters

Example:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15]
}

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

Module: A6-A7

H

Hyperparameter

A parameter whose value is set before training and controls the learning process.Examples:

Learning rate
Number of trees in random forest
K in K-nearest neighbors
Number of epochs
Batch size

Contrast with: Model parameters (learned during training, like weights)Tuning Methods: Grid search, random search, Bayesian optimizationModule: A6-A8

Hypothesis Testing

A statistical method to make inferences about population parameters based on sample data.Steps:

State null hypothesis (H₀) and alternative hypothesis (H₁)
Choose significance level (α, typically 0.05)
Calculate test statistic
Determine p-value
Make decision: reject or fail to reject H₀

Common Tests: t-test, chi-square test, ANOVAModule: A5 (Statistics)

K

K-Fold Cross-Validation

See: Cross-Validation

K-Means

A clustering algorithm that partitions data into K clusters.Algorithm:

Initialize K cluster centroids randomly
Assign each point to nearest centroid
Recalculate centroids as mean of assigned points
Repeat steps 2-3 until convergence

Choosing K: Elbow method, silhouette scoreLimitations: Assumes spherical clusters, sensitive to initializationModule: A7

KNN (K-Nearest Neighbors)

A simple algorithm that makes predictions based on the K closest training examples.Classification: Majority vote of K neighborsRegression: Average of K neighborsChoosing K:

Small K: More sensitive to noise
Large K: Smoother boundaries
Odd K (for classification): Avoids ties

Requirement: Feature scaling is criticalModule: A6

L

Learning Rate

A hyperparameter that controls how much to update model parameters during training.Impact:

Too high: Model may overshoot minimum, fail to converge
Too low: Very slow training, may get stuck in local minimum

Typical Values: 0.001, 0.01, 0.1Advanced: Learning rate scheduling, adaptive methods (Adam, RMSprop)Module: A6-A8

Logistic Regression

A classification algorithm that models the probability of an instance belonging to a class.Output: Probability between 0 and 1 (using sigmoid function)Decision Rule: Predict class 1 if probability > 0.5, else class 0Despite the name: It’s classification, not regressionUse Case: Binary classificationModule: A6

Loss Function

A function that measures how well a model’s predictions match the actual values.Also called: Cost function, objective functionCommon Loss Functions:

MSE (Mean Squared Error): For regression
Cross-Entropy: For classification
MAE (Mean Absolute Error): For regression

Goal: Minimize the loss during trainingModule: A6-A8

M

Machine Learning

A field of study that gives computers the ability to learn from data without being explicitly programmed.Types:

Supervised: Learn from labeled data (classification, regression)
Unsupervised: Find patterns in unlabeled data (clustering, dimensionality reduction)
Reinforcement: Learn through interaction and feedback

Module: A6-A8 (primary focus)

Model

A mathematical representation of a real-world process, learned from data.In ML: An algorithm trained on data that can make predictions on new dataLifecycle:

Training: Learn patterns from data
Validation: Tune hyperparameters
Testing: Evaluate final performance
Deployment: Use in production

Module: A6-A8

N

Neural Network

A machine learning model inspired by biological neural networks, consisting of interconnected nodes (neurons) organized in layers.Structure:

Input layer: Receives features
Hidden layers: Process information
Output layer: Produces predictions

Components:

Weights: Connection strengths
Biases: Offset values
Activation functions: Introduce non-linearity

Types: Feedforward, Convolutional (CNN), Recurrent (RNN)Module: A8

Normalization

See: Feature Scaling

NumPy

Python library for numerical computing with multi-dimensional arrays.Key Features:

Fast array operations
Broadcasting
Linear algebra functions
Random number generation

Foundation for: Pandas, scikit-learn, TensorFlow, PyTorchModule: A3 (primary), used throughout A4-A8

O

Outlier

A data point that differs significantly from other observations.Detection Methods:

Box plots (IQR method)
Z-score
Isolation Forest

Handling:

Remove (if error)
Transform (log, square root)
Cap (winsorization)
Keep (if genuine)

Module: A4 (EDA)

Overfitting

When a model learns training data too well, including noise, resulting in poor generalization to new data.Signs:

High training accuracy, low test accuracy
Model is too complex
Training loss decreases but validation loss increases

Solutions:

Collect more data
Regularization
Reduce model complexity
Cross-validation
Early stopping
Dropout (neural networks)

Module: A6-A8

P

Pandas

Python library for data manipulation and analysis.Key Data Structures:

DataFrame: 2D table
Series: 1D array

Capabilities:

Data loading (CSV, Excel, SQL, etc.)
Data cleaning and transformation
Grouping and aggregation
Time series analysis

Module: A3 (primary), used throughout A4-A8

PCA (Principal Component Analysis)

A dimensionality reduction technique that transforms data into a new coordinate system.Purpose:

Reduce number of features
Remove correlations
Visualize high-dimensional data
Speed up training

Principal Components: New features that capture maximum varianceExample:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

Module: A7

Precision

The proportion of positive predictions that are actually correct.Formula: Precision = TP / (TP + FP)Interpretation: “Of all instances we predicted as positive, how many were actually positive?”When Important: When false positives are costly (e.g., spam detection)Related: Recall, F1-scoreModule: A6

R

Random Forest

An ensemble learning method that constructs multiple decision trees and combines their predictions.Process:

Create multiple decision trees on random subsets of data
Each tree uses random subset of features
Aggregate predictions (majority vote or average)

Advantages:

Reduces overfitting compared to single decision tree
Handles non-linear relationships
Feature importance

Module: A6

Recall

The proportion of actual positive instances that are correctly identified.Formula: Recall = TP / (TP + FN)Also called: Sensitivity, True Positive RateInterpretation: “Of all actual positive instances, how many did we find?”When Important: When false negatives are costly (e.g., disease detection)Module: A6

Regression

A supervised learning task where the goal is to predict continuous numerical values.Types:

Linear Regression: Models linear relationships
Polynomial Regression: Models non-linear relationships
Multiple Regression: Multiple features

Example: Predicting house prices, temperature, stock pricesMetrics: MSE, RMSE, MAE, R²Module: A6

Regularization

Techniques to prevent overfitting by adding a penalty for model complexity.Types:

L1 (Lasso): Adds sum of absolute weights → Sparse models
L2 (Ridge): Adds sum of squared weights → Distributes weights
Elastic Net: Combination of L1 and L2
Dropout: Randomly deactivate neurons (neural networks)

Effect: Simpler, more generalizable modelsModule: A6-A8

RMSE (Root Mean Squared Error)

A common regression metric that measures average prediction error.Formula: RMSE = sqrt(mean((y_pred - y_true)²))Properties:

Same units as target variable
Penalizes large errors more than MAE
Lower is better

Module: A6

S

Supervised Learning

Machine learning where the model learns from labeled data (input-output pairs).Tasks:

Classification (discrete outputs)
Regression (continuous outputs)

Examples: Email spam detection, house price predictionRequirements: Labeled training dataModule: A6 (primary)

Series

Pandas’ 1-dimensional labeled array, capable of holding any data type.Like: A single column of a DataFrame, or NumPy array with labelsExample:

import pandas as pd
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

Module: A3

T

Test Set

A portion of data held out for final model evaluation.Purpose: Provides unbiased estimate of model performance on unseen dataCritical Rule: Never use test set during model development or hyperparameter tuningTypical Split: 80% train, 20% test (or 70-15-15 train-val-test)Module: A6-A8

Training Set

The portion of data used to train the model.Purpose: Model learns patterns from this dataSize: Typically 60-80% of total dataModule: A6-A8

U

Underfitting

When a model is too simple to capture the underlying patterns in the data.Signs:

Low training accuracy
Low test accuracy
Model is too simple

Solutions:

Increase model complexity
Add more features
Reduce regularization
Train longer

Module: A6-A8

Unsupervised Learning

Machine learning where the model finds patterns in unlabeled data.Tasks:

Clustering (grouping similar items)
Dimensionality reduction (simplifying data)
Anomaly detection (finding unusual patterns)

Examples: Customer segmentation, data compressionModule: A7

V

Validation Set

A portion of data used to tune hyperparameters and make model selection decisions.Purpose: Evaluate model during development without touching test setAlternative: Use cross-validation instead of a fixed validation setTypical Split: 15-20% of data (when using train-val-test split)Module: A6-A8

Variance

A measure of how much model predictions change with different training data.High Variance: Model is sensitive to training data (overfitting)Low Variance: Model predictions are stableBias-Variance Tradeoff: Balance between bias (underfitting) and variance (overfitting)Module: A6-A7

Vectorization

Performing operations on entire arrays rather than using loops.Benefits:

Much faster execution
More readable code
Utilizes optimized C/Fortran code under the hood

Example:

# Slow (loop)
result = []
for x in arr:
    result.append(x * 2)

# Fast (vectorized)
result = arr * 2

Module: A3 (NumPy)

Additional Terms

Accuracy

(TP + TN) / TotalProportion of correct predictions

Anomaly Detection

Identifying unusual patterns that don’t conform to expected behavior

API

Application Programming Interface - way programs communicate

Batch Normalization

Technique to normalize layer inputs in neural networks

Categorical Variable

Variable with discrete categories (e.g., color, country)

Confusion Matrix

Table showing model prediction vs actual values

Correlation

Statistical relationship between two variables (-1 to +1)

Data Leakage

When training data contains information about test data

Dropout

Regularization by randomly ignoring neurons during training

Encoding

Converting categorical data to numerical (one-hot, label encoding)

False Negative (FN)

Incorrectly predicted negative (Type II error)

False Positive (FP)

Incorrectly predicted positive (Type I error)

Gradient

Direction and rate of steepest increase of a function

Histogram

Bar chart showing distribution of numerical data

Imputation

Filling in missing values in dataset

Label

The target variable in supervised learning (what we predict)

MAE

Mean Absolute Error - average of absolute differences

Matrix

2D array of numbers

MSE

Mean Squared Error - average of squared differences

One-Hot Encoding

Converting categories to binary columns

Pipeline

Chain of data processing steps

Prediction

Model’s output for a given input

R² Score

Coefficient of determination (0 to 1, higher is better)

Sampling

Selecting subset of data from larger population

Scaling

Transforming features to similar ranges

Silhouette Score

Metric for evaluating clustering quality

Target Variable

What we’re trying to predict (dependent variable, label)

True Negative (TN)

Correctly predicted negative

True Positive (TP)

Correctly predicted positive

Weight

Parameter that scales input to neuron or feature

Quick Reference by Module

Array, Broadcasting, DataFrame, Series, Vectorization, Data Cleaning

Next Steps

Jupyter Notebooks

See these concepts in action across 111+ notebooks

Datasets

Practice with real datasets

Tools & Libraries

Learn the tools that implement these concepts

Module Overview

Understand how modules build on these concepts

Learning Resources

​Overview

​A

​B

​C

​D

​E

​F

​G

​H

​K

​L

​M

​N

​O

​P

​R

​S

​T

​U

​V

​Additional Terms

Accuracy

Anomaly Detection

API

Batch Normalization

Categorical Variable

Confusion Matrix

Correlation

Data Leakage

Dropout

Encoding

False Negative (FN)

False Positive (FP)

Gradient

Histogram

Imputation

Label

MAE

Matrix

MSE

One-Hot Encoding

Pipeline

Prediction

R² Score

Sampling

Scaling

Silhouette Score

Target Variable

True Negative (TN)

True Positive (TP)

Weight

​Quick Reference by Module

​Next Steps

Jupyter Notebooks

Datasets

Tools & Libraries

Module Overview

Build docs developers (and LLMs) love

Overview

A

B

C

D

E

F

G

H

K

L

M

N

O

P

R

S

T

U

V

Additional Terms

Quick Reference by Module

Next Steps