Prerequisites
Before you begin, ensure you have:- Python 3.8 or higher
- pip package manager
- Git (optional, for cloning the repository)
Installation
Install dependencies
Upgrade pip and install all required packages:
View key dependencies
View key dependencies
The project uses these core libraries:
- FastAPI 0.115.0 - API server framework
- scikit-learn 1.5.2 - Machine learning models
- pandas 2.2.2 - Data processing
- ONNX Runtime 1.19.2 - Optimized inference
- uvicorn 0.30.6 - ASGI server
Training Your First Model
The training pipeline trains multiple classifiers and selects the best performer based on cross-validation.Run Training
Expected Output
The training script will:- Load data from
ml_datasource.csv(src/data.py:28) - Train 5 different models: Logistic Regression, KNN, SVM, Decision Tree, and Random Forest (src/train.py:58-97)
- Perform 5-fold stratified cross-validation (src/train.py:130-134)
- Select the best model based on ROC-AUC score (src/train.py:156)
- Calibrate decision threshold to target 90% precision (config.yaml:34)
The training process uses a configurable seed (default: 42) for reproducibility. All random operations are deterministic.
Generated Artifacts
Training creates several files in theartifacts/ directory (src/train.py:172-224):
| File | Description | Reference |
|---|---|---|
best_model.joblib | Trained sklearn pipeline with preprocessor | config.yaml:37 |
threshold.txt | Calibrated decision threshold | config.yaml:38 |
metrics.json | Performance metrics and CV results | config.yaml:39 |
lineage.json | SHA-256 hashes for reproducibility | config.yaml:41 |
drift_baseline.json | Training distribution statistics | config.yaml:40 |
Starting the API Server
The FastAPI server provides endpoints for predictions, batch inference, and drift monitoring.Launch the Server
Verify Server Health
Check that the model loaded successfully:ready is false, the model artifacts are missing. Re-run training.
Making Your First Prediction
The API accepts student engagement features and returns purchase probability.Single Prediction
Send a POST request to/predict with student features:
Understanding the response
Understanding the response
- predicted_purchase_probability: Raw model probability (0.0 to 1.0)
- predicted_purchase: Binary prediction (0 or 1) using calibrated threshold
Batch Predictions
Process multiple records in a single request:Input Validation
The API enforces schema validation using Pydantic (src/api.py:27-35):Required Fields
student_country(2-64 chars)days_on_platform(≥0)minutes_watched(≥0)courses_started(≥0)practice_exams_started(≥0)practice_exams_passed(≥0)minutes_spent_on_exams(≥0)
Business Rules
practice_exams_passedcannot exceedpractice_exams_started- Validation occurs at src/api.py:243-248
Handling Validation Errors
Invalid requests return HTTP 422:Monitoring and Drift Detection
The API tracks prediction statistics and feature distributions in real-time.Check Drift Status
Drift detection requires at least 50 samples (config.yaml:44). The system compares running feature means to training baselines using z-scores (src/api.py:91-172).
Retraining Triggers
The system recommends retraining when:- ≥2 features drift beyond z-score threshold of 3.0 (config.yaml:45-46)
- Prediction rate shifts >10% from training rate (config.yaml:47)
Troubleshooting
Model artifacts missing
Model artifacts missing
Error: Verify artifacts were created:
RuntimeError: Model artifacts are missingSolution: Run training first:Port already in use
Port already in use
Error:
OSError: [Errno 98] Address already in useSolution: Use a different port:Import errors
Import errors
Error:
ModuleNotFoundError: No module named 'src'Solution: Run from the repository root directory:Low prediction accuracy
Low prediction accuracy
Issue: Model performance is poorInvestigation steps:
- Check metrics.json for CV scores
- Verify data quality in ml_datasource.csv
- Adjust model hyperparameters in config.yaml
- Review feature engineering in src/features.py
Next Steps
System Architecture
Learn how components connect across the ML lifecycle
Configuration Guide
Customize model training, features, and monitoring
Deployment Guide
Export to ONNX and deploy optimized models
Benchmarking
Measure latency, throughput, and accuracy trade-offs