Lifecycle Overview
The system follows a 5-stage ML lifecycle:The feedback loop from monitoring to training enables data-driven retraining decisions based on drift detection and performance degradation.
Stage 1: Data Pipeline
Data Sources
The system ingests data from multiple sources:- CSV files - Primary data source (ml_datasource.csv)
- SQL-backed assets - Schema-based data loaders (future extension)
- Event streams - Real-time pipeline simulation (real_time_pipelines/)
Data Loading
Data ingestion is centralized insrc/data.py:
Dataset Schema
The student engagement dataset includes:| Feature | Type | Description |
|---|---|---|
| student_country | string | Two-letter country code |
| days_on_platform | int | Days since registration |
| minutes_watched | float | Total video watch time |
| courses_started | int | Number of courses enrolled |
| practice_exams_started | int | Number of exams attempted |
| practice_exams_passed | int | Number of exams passed |
| minutes_spent_on_exams | float | Total exam time |
| purchased | int | Target variable (0 or 1) |
Configuration-Driven Execution
All data parameters are specified inconfig.yaml:
Stage 2: Training Pipeline
Model Training Architecture
The training system uses sklearn pipelines with preprocessing and modeling stages:Model Selection
The system trains 5 different classifiers in parallel:Logistic Regression
Fast, interpretable baseline
- max_iter: 2000
- L2 regularization
K-Nearest Neighbors
Non-parametric classifier
- n_neighbors: 7
Support Vector Machine
Kernel-based classifier
- kernel: RBF
- C: 1.0
Decision Tree
Interpretable tree model
- max_depth: 8
- min_samples_leaf: 10
Random Forest
Ensemble classifier
- n_estimators: 400
- min_samples_leaf: 2
Cross-Validation Strategy
The training pipeline uses stratified k-fold cross-validation:Threshold Calibration
The system calibrates decision thresholds to meet business precision targets:Artifact Generation
Training produces multiple artifacts for lineage tracking:Lineage tracking details
Lineage tracking details
Each training run generates a unique
run_id and captures SHA-256 hashes for:- Dataset file (ml_datasource.csv)
- Configuration file (config.yaml)
- Model artifact (best_model.joblib)
- Threshold file (threshold.txt)
scripts/reproducibility_check.py.Stage 3: Optimization and Benchmarking
Statistical Benchmarking
The benchmarking system measures repeated-run performance:- Latency: p50, p95, p99 percentiles
- Throughput: samples/second
- Memory: peak RSS
- Accuracy: ROC-AUC, precision, recall
Benchmark results are hardware-dependent. Run on target deployment hardware for accurate measurements.
Hardware-Aware Trade-offs
The trade-off analysis compares deployment options:- sklearn vs ONNX Runtime latency
- FP32 vs INT8 quantization impact
- Batch size vs throughput curves
- Accuracy vs speed Pareto frontiers
ONNX Export and Quantization
The deployment pipeline converts sklearn models to ONNX format for portability:- Export:
deployment/export_onnx.py- Convert sklearn to ONNX - Quantize:
deployment/quantize_onnx.py- INT8 quantization for CPU inference - Validate:
deployment/parity_check.py- Numerical parity verification
Parity Checking
Parity checks enforce numerical equivalence:- Absolute error per sample < 0.04
- Mean absolute error < 0.01
Stage 4: Deployment Layer
FastAPI Service Architecture
The deployment layer exposes a RESTful API:API Endpoints
| Endpoint | Method | Description | Reference |
|---|---|---|---|
/health | GET | Service health check | src/api.py:275-281 |
/predict | POST | Single prediction | src/api.py:284-289 |
/batch_predict | POST | Batch predictions | src/api.py:292-297 |
/monitoring/drift | GET | Drift detection status | src/api.py:300-302 |
/monitoring/retraining_trigger | GET | Retraining recommendation | src/api.py:305-307 |
Request/Response Schemas
Pydantic models enforce type safety:Startup Lifecycle
Artifact loading occurs during application startup:- Configuration from
config.yaml - Trained model from
artifacts/best_model.joblib - Threshold from
artifacts/threshold.txt - Drift baseline from
artifacts/drift_baseline.json
Stage 5: Monitoring and Observability
Real-Time Drift Detection
The monitoring system tracks feature distributions during inference:Drift Scoring
Drift is detected using z-score comparison:Retraining Triggers
The system recommends retraining when:Feature Drift
≥2 features exceed z-score threshold of 3.0Config:
drift_min_features: 2, drift_zscore_threshold: 3.0Prediction Shift
Predicted positive rate shifts >10% from trainingConfig:
class_rate_shift_threshold: 0.1Prediction Logging
All predictions are logged to JSONL format:Runtime Metrics
Monitoring artifacts are written toartifacts/ for downstream analysis:
- prediction_log.jsonl - Per-request inference logs
- Drift baseline - Training distribution statistics
- Benchmark snapshots - Latency and throughput measurements
Component Interconnections
Directory Structure
The repository organizes components by lifecycle stage:Configuration Flow
All components read fromconfig.yaml:
Makefile Shortcuts
Common workflows are automated:Design Principles
Reproducibility
Configuration-driven execution with deterministic seeds and lineage tracking
Portability
ONNX export enables deployment across runtimes and hardware targets
Observability
Drift detection, prediction logging, and performance benchmarking
Validation
Parity checks enforce numerical equivalence between sklearn and ONNX
Trade-offs and Limitations
Latency vs Accuracy
Latency vs Accuracy
Quantized INT8 models improve latency on many CPU targets but may introduce small accuracy shifts. Always validate parity before production deployment.Reference: systems_overview/workflow.md:24
Throughput vs Queue Delay
Throughput vs Queue Delay
Streaming worker scaling improves throughput but can increase contention and queue backpressure.Reference: systems_overview/workflow.md:25
Portability vs Feature Completeness
Portability vs Feature Completeness
ONNX export improves portability but may not support all sklearn operators. Test conversion for custom estimators.Reference: systems_overview/workflow.md:26
Hardware Dependency
Hardware Dependency
Benchmark results vary by hardware. CPU models, cache sizes, and instruction sets affect performance.Reference: README.md:79
Failure Modes
The following conditions are treated as release blockers:- Parity drift: ONNX predictions differ from sklearn beyond tolerance
- Schema mismatch: Training and serving features have different names/types
- Queue saturation: Streaming pipeline backpressure exceeds capacity
- Drift detection failure: Monitoring system cannot compute drift scores
Next Steps
Quickstart Guide
Train your first model and make predictions in 10 minutes
Configuration Reference
Detailed breakdown of all config.yaml options
Deployment Guide
Production deployment patterns and ONNX optimization
Benchmarking Guide
Measure and optimize model performance