What is supervised learning?
Supervised learning is a machine learning paradigm where models learn from labeled training data to make predictions on new, unseen data. The “supervision” comes from having correct answers (labels) during training.Key Characteristics
- Labeled data: Each training example includes both input features and the correct output
- Learning objective: Find patterns that map inputs to outputs
- Prediction: Apply learned patterns to make predictions on new data
- Evaluation: Measure performance using known correct answers on test data
Types of Supervised Learning
1. Regression
Predict continuous numerical values. Examples:- Predicting house prices based on features (size, location, bedrooms)
- Forecasting sales revenue for an e-commerce business
- Estimating customer lifetime value
total_sales per order.
2. Classification
Predict discrete categories or classes. Examples:- Email spam detection (spam vs. not spam)
- Customer churn prediction (will leave vs. will stay)
- Image recognition (identify objects in photos)
The Machine Learning Pipeline
Data collection and exploration
Gather data and understand its structure, quality, and distributions.
- Load datasets (CSV, Excel, databases)
- Check for missing values, duplicates, outliers
- Visualize distributions and relationships
Data preprocessing
Clean and transform data into a format suitable for machine learning.For numeric features:
- Impute missing values (median, mean)
- Scale/normalize (StandardScaler, MinMaxScaler)
- Create derived features (e.g., date differences)
- Encode as numbers (OneHotEncoder, LabelEncoder)
- Handle high-cardinality categories
- Impute missing categories (mode)
Model selection and training
Choose appropriate algorithms and train models.
- Start with simple baseline models
- Try multiple algorithms
- Use cross-validation to assess stability
Model evaluation
Assess model performance using appropriate metrics.Regression metrics: MAE, MSE, RMSE, R²Classification metrics: Accuracy, Precision, Recall, F1-score
Model optimization
Improve model performance through hyperparameter tuning.
- Grid search (GridSearchCV)
- Random search (RandomizedSearchCV)
- Feature engineering and selection
Key Concepts
Features and target variables
Features and target variables
Features (X): Input variables used to make predictions
- Also called: independent variables, predictors, inputs
- Example: In e-commerce, features might include product category, price, discount, shipping cost
- Also called: dependent variable, label, response
- Example: In e-commerce, target might be total_sales amount
Training vs. testing data
Training vs. testing data
Training set: Data used to fit the model
- Model learns patterns from this data
- Typically 70-80% of total data
- Simulates real-world prediction on unseen data
- Typically 20-30% of total data
- Must never be used during training
Overfitting vs. underfitting
Overfitting vs. underfitting
Overfitting: Model learns training data too well, including noise
- High training accuracy, low test accuracy
- Solution: More data, regularization, simpler model
- Low training accuracy, low test accuracy
- Solution: More complex model, more features, less regularization
Cross-validation
Cross-validation
K-Fold cross-validation splits data into K parts (folds):
- Train on K-1 folds, test on 1 fold
- Repeat K times, each time with a different test fold
- Average the K test scores
- More reliable performance estimate
- All data used for both training and testing
- Reduces variance in performance metrics
K-Fold Cross-Validation
Scikit-learn: The Standard Library
Scikit-learn provides a consistent API for machine learning in Python.Scikit-learn Pattern
All scikit-learn models follow the same pattern:
fit(), predict(), score(). This makes it easy to experiment with different algorithms.Module A6: Hands-On Learning
In Module A6 of the bootcamp, you’ll work through:Regression models
Linear regression, polynomial features, Ridge, Lasso, Gradient Boosting
Classification models
Logistic regression, KNN, decision trees, random forests
Model evaluation
Metrics, confusion matrices, ROC curves, model selection
E-commerce project
Predict sales using real Amazon dataset with 10,000 orders
Best Practices
Feature engineering often matters more than algorithm choice. Spend time creating meaningful derived features from your data.
Next Steps
Learn regression modeling
Dive into linear regression, polynomial features, and ensemble methods