Overview
After model selection, the best model is evaluated on the test set with calibrated decision thresholds to meet business requirements. The evaluation focuses on precision-recall tradeoffs and ROC AUC performance.Evaluation Metrics
Four primary metrics are computed on the test set:1. ROC AUC (Area Under ROC Curve)
Measures overall classification performance across all thresholds. Range: 0.0 to 1.0 (higher is better) Interpretation:- 0.5: Random guessing
- 0.7-0.8: Acceptable performance
- 0.8-0.9: Good performance
- 0.9+: Excellent performance
src/train.py:193
2. Precision
Proportion of predicted purchases that were actual purchases. Formula:TP / (TP + FP)
Why it matters: High precision means fewer false alarms (users predicted to purchase who don’t)
Implementation: src/train.py:194
3. Recall
Proportion of actual purchases that were correctly predicted. Formula:TP / (TP + FN)
Why it matters: High recall means capturing most potential purchasers
Implementation: src/train.py:195
4. F1 Score
Harmonic mean of precision and recall. Formula:2 × (precision × recall) / (precision + recall)
Why it matters: Balances precision and recall into single metric
Implementation: src/train.py:196
Precision-Recall Threshold Calibration
The default 0.5 threshold is replaced with a calibrated threshold to meet business precision targets. Implementation:src/train.py:159-170
Threshold Selection Algorithm
- Generate precision-recall curve at all possible thresholds
- Filter candidates where precision ≥ target_precision
- Select threshold with maximum recall among candidates
- Fallback: If no candidates, use threshold with highest precision
Business Configuration
Target precision is configured inconfig.yaml:
Why Calibrate Thresholds?
Default 0.5 threshold may not align with business goals:- Marketing campaigns: High precision reduces wasted ad spend
- Sales outreach: Focus efforts on likely purchasers
- User experience: Avoid over-targeting uninterested users
Metrics Output Format
All metrics are saved tometrics.json:
Implementation: src/train.py:188-199
Metrics Structure
| Field | Type | Description |
|---|---|---|
run_id | string | Unique identifier for training run |
best_model_name | string | Name of selected model |
calibration | object | Threshold calibration details |
calibration.type | string | Calibration method (“threshold”) |
calibration.target_precision | float | Target precision from config |
calibration.threshold | float | Selected decision threshold |
accuracy | float | Overall classification accuracy |
roc_auc | float | ROC AUC score |
precision | float | Precision at calibrated threshold |
recall | float | Recall at calibrated threshold |
f1 | float | F1 score at calibrated threshold |
cv_ranking | array | Cross-validation results for all models |
Artifacts Configuration
Output locations are configured inconfig.yaml:
Generated Files
- best_model.joblib: Trained scikit-learn pipeline
- threshold.txt: Calibrated threshold value (plain text)
- metrics.json: Complete evaluation metrics (JSON)
- drift_baseline.json: Training data statistics for drift detection
- lineage.json: Data and model provenance information
Model Persistence
The best model and threshold are saved for production use: Implementation:src/train.py:172-181
Lineage Tracking
Data and model lineage is tracked with SHA256 hashes: Implementation:src/train.py:201-220
Lineage Benefits
- Reproducibility: Track exact data and config versions
- Auditability: Verify model provenance
- Debugging: Identify which data produced which model
- Rollback: Match models to their training data
Usage Example
Complete Training Flow
The full training process (src/train.py:116-227):
- Load configuration and data
- Split into train/test sets
- Build preprocessor and models
- Cross-validate all models
- Select best model by ROC AUC
- Generate precision-recall curve
- Calibrate threshold to target precision
- Evaluate on test set
- Save model, threshold, and metrics
- Track lineage with hashes
Next Steps
Model Selection
Learn about cross-validation and model comparison
Data Loading
Understand the data pipeline