Overview
This glossary defines key terms and concepts used throughout the Data Science Bootcamp. Terms are organized alphabetically for quick reference.A
Activation Function
Activation Function
- ReLU (Rectified Linear Unit):
f(x) = max(0, x) - Sigmoid:
f(x) = 1 / (1 + e^(-x)) - Tanh:
f(x) = tanh(x) - Softmax: Used for multi-class classification output
Algorithm
Algorithm
Array
Array
B
Backpropagation
Backpropagation
Batch Size
Batch Size
- Large batch: Faster training, more memory, less noise
- Small batch: Slower training, less memory, more exploration
Bias (Statistical)
Bias (Statistical)
Bias (Neural Network)
Bias (Neural Network)
output = weights * input + biasPurpose: Allows the model to fit data that doesn’t pass through the originModule: A8Broadcasting
Broadcasting
C
Classification
Classification
- Binary: Two classes (e.g., spam/not spam)
- Multiclass: More than two classes (e.g., image classification)
Clustering
Clustering
- K-Means: Partitions data into K clusters
- Hierarchical: Creates a tree of clusters
- DBSCAN: Density-based clustering
Confusion Matrix
Confusion Matrix
- TP (True Positive): Correctly predicted positive
- TN (True Negative): Correctly predicted negative
- FP (False Positive): Incorrectly predicted positive (Type I error)
- FN (False Negative): Incorrectly predicted negative (Type II error)
Cross-Validation
Cross-Validation
- Split data into K equal parts (folds)
- Train on K-1 folds, test on the remaining fold
- Repeat K times, using each fold as test set once
- Average the results
D
DataFrame
DataFrame
- Row and column labels (index)
- Heterogeneous data types
- Built-in methods for data manipulation
- Easy filtering, grouping, and aggregation
Data Augmentation
Data Augmentation
Data Cleaning
Data Cleaning
- Handling missing values
- Removing duplicates
- Correcting data types
- Fixing inconsistencies
- Removing outliers
Decision Tree
Decision Tree
- Root node: First decision
- Internal nodes: Subsequent decisions
- Leaf nodes: Final predictions
Deep Learning
Deep Learning
- Multiple hidden layers
- Automatic feature learning
- Requires large datasets
- GPU acceleration
Dimensionality Reduction
Dimensionality Reduction
- PCA (Principal Component Analysis): Linear transformation
- t-SNE: Non-linear, good for visualization
- UMAP: Non-linear, preserves global structure
E
EDA (Exploratory Data Analysis)
EDA (Exploratory Data Analysis)
- Understand data structure
- Detect patterns and anomalies
- Test assumptions
- Identify relationships between variables
- Summary statistics (
df.describe()) - Visualizations (histograms, box plots, scatter plots)
- Correlation analysis
- Distribution analysis
Ensemble Methods
Ensemble Methods
- Bagging: Train models independently on random subsets (e.g., Random Forest)
- Boosting: Train models sequentially, each correcting errors of previous (e.g., Gradient Boosting)
- Stacking: Combine predictions from multiple models using a meta-model
Epoch
Epoch
- Too few: Underfitting
- Too many: Overfitting
- Use early stopping to find optimal number
F
Feature
Feature
- Numerical: Continuous (price) or discrete (count)
- Categorical: Nominal (color) or ordinal (rating)
Feature Engineering
Feature Engineering
- Creating interaction terms
- Binning continuous variables
- Encoding categorical variables
- Extracting date/time components
- Polynomial features
- Domain-specific transformations
Feature Scaling
Feature Scaling
- Standardization (Z-score):
(x - mean) / std→ Mean=0, Std=1 - Min-Max Normalization:
(x - min) / (max - min)→ Range [0, 1]
F1 Score
F1 Score
F1 = 2 * (precision * recall) / (precision + recall)Range: 0 to 1 (higher is better)When to Use: Imbalanced datasets where both false positives and false negatives are importantModule: A6G
Gradient Descent
Gradient Descent
- Calculate gradient (slope) of loss function
- Update parameters in opposite direction of gradient
- Repeat until convergence
- Batch GD: Uses entire dataset
- Stochastic GD: Uses one sample at a time
- Mini-batch GD: Uses small batches (most common)
GridSearchCV
GridSearchCV
- Define parameter grid
- Train model for each combination
- Use cross-validation to evaluate
- Return best parameters
H
Hyperparameter
Hyperparameter
- Learning rate
- Number of trees in random forest
- K in K-nearest neighbors
- Number of epochs
- Batch size
Hypothesis Testing
Hypothesis Testing
- State null hypothesis (H₀) and alternative hypothesis (H₁)
- Choose significance level (α, typically 0.05)
- Calculate test statistic
- Determine p-value
- Make decision: reject or fail to reject H₀
K
K-Fold Cross-Validation
K-Fold Cross-Validation
K-Means
K-Means
- Initialize K cluster centroids randomly
- Assign each point to nearest centroid
- Recalculate centroids as mean of assigned points
- Repeat steps 2-3 until convergence
KNN (K-Nearest Neighbors)
KNN (K-Nearest Neighbors)
- Small K: More sensitive to noise
- Large K: Smoother boundaries
- Odd K (for classification): Avoids ties
L
Learning Rate
Learning Rate
- Too high: Model may overshoot minimum, fail to converge
- Too low: Very slow training, may get stuck in local minimum
Logistic Regression
Logistic Regression
Loss Function
Loss Function
- MSE (Mean Squared Error): For regression
- Cross-Entropy: For classification
- MAE (Mean Absolute Error): For regression
M
Machine Learning
Machine Learning
- Supervised: Learn from labeled data (classification, regression)
- Unsupervised: Find patterns in unlabeled data (clustering, dimensionality reduction)
- Reinforcement: Learn through interaction and feedback
Model
Model
- Training: Learn patterns from data
- Validation: Tune hyperparameters
- Testing: Evaluate final performance
- Deployment: Use in production
N
Neural Network
Neural Network
- Input layer: Receives features
- Hidden layers: Process information
- Output layer: Produces predictions
- Weights: Connection strengths
- Biases: Offset values
- Activation functions: Introduce non-linearity
Normalization
Normalization
NumPy
NumPy
- Fast array operations
- Broadcasting
- Linear algebra functions
- Random number generation
O
Outlier
Outlier
- Box plots (IQR method)
- Z-score
- Isolation Forest
- Remove (if error)
- Transform (log, square root)
- Cap (winsorization)
- Keep (if genuine)
Overfitting
Overfitting
- High training accuracy, low test accuracy
- Model is too complex
- Training loss decreases but validation loss increases
- Collect more data
- Regularization
- Reduce model complexity
- Cross-validation
- Early stopping
- Dropout (neural networks)
P
Pandas
Pandas
- DataFrame: 2D table
- Series: 1D array
- Data loading (CSV, Excel, SQL, etc.)
- Data cleaning and transformation
- Grouping and aggregation
- Time series analysis
PCA (Principal Component Analysis)
PCA (Principal Component Analysis)
- Reduce number of features
- Remove correlations
- Visualize high-dimensional data
- Speed up training
Precision
Precision
Precision = TP / (TP + FP)Interpretation: “Of all instances we predicted as positive, how many were actually positive?”When Important: When false positives are costly (e.g., spam detection)Related: Recall, F1-scoreModule: A6R
Random Forest
Random Forest
- Create multiple decision trees on random subsets of data
- Each tree uses random subset of features
- Aggregate predictions (majority vote or average)
- Reduces overfitting compared to single decision tree
- Handles non-linear relationships
- Feature importance
Recall
Recall
Recall = TP / (TP + FN)Also called: Sensitivity, True Positive RateInterpretation: “Of all actual positive instances, how many did we find?”When Important: When false negatives are costly (e.g., disease detection)Module: A6Regression
Regression
- Linear Regression: Models linear relationships
- Polynomial Regression: Models non-linear relationships
- Multiple Regression: Multiple features
Regularization
Regularization
- L1 (Lasso): Adds sum of absolute weights → Sparse models
- L2 (Ridge): Adds sum of squared weights → Distributes weights
- Elastic Net: Combination of L1 and L2
- Dropout: Randomly deactivate neurons (neural networks)
RMSE (Root Mean Squared Error)
RMSE (Root Mean Squared Error)
RMSE = sqrt(mean((y_pred - y_true)²))Properties:- Same units as target variable
- Penalizes large errors more than MAE
- Lower is better
S
Supervised Learning
Supervised Learning
- Classification (discrete outputs)
- Regression (continuous outputs)
Series
Series
T
Test Set
Test Set
Training Set
Training Set
U
Underfitting
Underfitting
- Low training accuracy
- Low test accuracy
- Model is too simple
- Increase model complexity
- Add more features
- Reduce regularization
- Train longer
Unsupervised Learning
Unsupervised Learning
- Clustering (grouping similar items)
- Dimensionality reduction (simplifying data)
- Anomaly detection (finding unusual patterns)
V
Validation Set
Validation Set
Variance
Variance
Vectorization
Vectorization
- Much faster execution
- More readable code
- Utilizes optimized C/Fortran code under the hood
Additional Terms
Accuracy
(TP + TN) / TotalProportion of correct predictionsAnomaly Detection
API
Batch Normalization
Categorical Variable
Confusion Matrix
Correlation
Data Leakage
Dropout
Encoding
False Negative (FN)
False Positive (FP)
Gradient
Histogram
Imputation
Label
MAE
Matrix
MSE
One-Hot Encoding
Pipeline
Prediction
R² Score
Sampling
Scaling
Silhouette Score
Target Variable
True Negative (TN)
True Positive (TP)
Weight
Quick Reference by Module
- A3: NumPy & Pandas
- A4: EDA
- A5: Statistics
- A6: Machine Learning
- A7: Advanced ML
- A8: Deep Learning