Skip to main content
This guide walks you through installation, training your first model, starting the API server, and making predictions.

Prerequisites

Before you begin, ensure you have:
  • Python 3.8 or higher
  • pip package manager
  • Git (optional, for cloning the repository)

Installation

1

Clone the repository

git clone <repository-url>
cd <repository-name>
2

Install dependencies

Upgrade pip and install all required packages:
python -m pip install --upgrade pip
pip install -r requirements.txt
The project uses these core libraries:
  • FastAPI 0.115.0 - API server framework
  • scikit-learn 1.5.2 - Machine learning models
  • pandas 2.2.2 - Data processing
  • ONNX Runtime 1.19.2 - Optimized inference
  • uvicorn 0.30.6 - ASGI server
3

Verify installation

Check that the configuration file is present:
cat config.yaml
You should see configuration for models, data paths, and artifacts.

Training Your First Model

The training pipeline trains multiple classifiers and selects the best performer based on cross-validation.

Run Training

python -m src.train

Expected Output

The training script will:
  1. Load data from ml_datasource.csv (src/data.py:28)
  2. Train 5 different models: Logistic Regression, KNN, SVM, Decision Tree, and Random Forest (src/train.py:58-97)
  3. Perform 5-fold stratified cross-validation (src/train.py:130-134)
  4. Select the best model based on ROC-AUC score (src/train.py:156)
  5. Calibrate decision threshold to target 90% precision (config.yaml:34)
The training process uses a configurable seed (default: 42) for reproducibility. All random operations are deterministic.
Example output:
{
  "run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "best_model_name": "Random Forest",
  "calibration": {
    "type": "threshold",
    "target_precision": 0.9,
    "threshold": 0.6524
  },
  "accuracy": 0.9234,
  "roc_auc": 0.9567,
  "precision": 0.9012,
  "recall": 0.8345,
  "f1": 0.8665,
  "cv_ranking": [
    {
      "model": "Random Forest",
      "cv_roc_auc_mean": 0.9523,
      "cv_precision_mean": 0.8934,
      "cv_recall_mean": 0.8456,
      "cv_f1_mean": 0.8687
    }
  ]
}

Generated Artifacts

Training creates several files in the artifacts/ directory (src/train.py:172-224):
FileDescriptionReference
best_model.joblibTrained sklearn pipeline with preprocessorconfig.yaml:37
threshold.txtCalibrated decision thresholdconfig.yaml:38
metrics.jsonPerformance metrics and CV resultsconfig.yaml:39
lineage.jsonSHA-256 hashes for reproducibilityconfig.yaml:41
drift_baseline.jsonTraining distribution statisticsconfig.yaml:40
You must complete training before starting the API server. The API requires best_model.joblib and threshold.txt to load.

Starting the API Server

The FastAPI server provides endpoints for predictions, batch inference, and drift monitoring.

Launch the Server

uvicorn src.api:app --host 0.0.0.0 --port 8000 --reload

Verify Server Health

Check that the model loaded successfully:
curl http://localhost:8000/health
Expected response:
{
  "ready": true,
  "predictor_loaded": true,
  "drift_baseline_loaded": true
}
If ready is false, the model artifacts are missing. Re-run training.

Making Your First Prediction

The API accepts student engagement features and returns purchase probability.

Single Prediction

Send a POST request to /predict with student features:
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "student_country": "US",
    "days_on_platform": 12,
    "minutes_watched": 366.7,
    "courses_started": 5,
    "practice_exams_started": 1,
    "practice_exams_passed": 0,
    "minutes_spent_on_exams": 3.27
  }'
Response:
{
  "predicted_purchase_probability": 0.8234,
  "predicted_purchase": 1
}
  • predicted_purchase_probability: Raw model probability (0.0 to 1.0)
  • predicted_purchase: Binary prediction (0 or 1) using calibrated threshold
The prediction is 1 if probability ≥ threshold (src/api.py:252)

Batch Predictions

Process multiple records in a single request:
curl -X POST "http://localhost:8000/batch_predict" \
  -H "Content-Type: application/json" \
  -d '{
    "records": [
      {
        "student_country": "US",
        "days_on_platform": 12,
        "minutes_watched": 366.7,
        "courses_started": 5,
        "practice_exams_started": 1,
        "practice_exams_passed": 0,
        "minutes_spent_on_exams": 3.27
      },
      {
        "student_country": "IN",
        "days_on_platform": 259,
        "minutes_watched": 118.0,
        "courses_started": 2,
        "practice_exams_started": 2,
        "practice_exams_passed": 1,
        "minutes_spent_on_exams": 16.48
      }
    ]
  }'
Response:
{
  "predictions": [
    {
      "predicted_purchase_probability": 0.8234,
      "predicted_purchase": 1
    },
    {
      "predicted_purchase_probability": 0.2156,
      "predicted_purchase": 0
    }
  ]
}

Input Validation

The API enforces schema validation using Pydantic (src/api.py:27-35):

Required Fields

  • student_country (2-64 chars)
  • days_on_platform (≥0)
  • minutes_watched (≥0)
  • courses_started (≥0)
  • practice_exams_started (≥0)
  • practice_exams_passed (≥0)
  • minutes_spent_on_exams (≥0)

Business Rules

  • practice_exams_passed cannot exceed practice_exams_started
  • Validation occurs at src/api.py:243-248

Handling Validation Errors

Invalid requests return HTTP 422:
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "student_country": "US",
    "days_on_platform": -5,
    "minutes_watched": 100.0,
    "courses_started": 1,
    "practice_exams_started": 1,
    "practice_exams_passed": 2,
    "minutes_spent_on_exams": 10.0
  }'
Error Response:
{
  "detail": "practice_exams_passed cannot exceed practice_exams_started."
}

Monitoring and Drift Detection

The API tracks prediction statistics and feature distributions in real-time.

Check Drift Status

curl http://localhost:8000/monitoring/drift
Response:
{
  "samples_observed": 127,
  "drift_score_max_abs_z": 2.34,
  "drifted_features": ["minutes_watched"],
  "predicted_positive_rate": 0.15,
  "training_positive_rate": 0.12,
  "should_retrain": false,
  "reason": "below_threshold",
  "recommended_action": "continue_monitoring"
}
Drift detection requires at least 50 samples (config.yaml:44). The system compares running feature means to training baselines using z-scores (src/api.py:91-172).

Retraining Triggers

The system recommends retraining when:
  • ≥2 features drift beyond z-score threshold of 3.0 (config.yaml:45-46)
  • Prediction rate shifts >10% from training rate (config.yaml:47)
Check retraining status:
curl http://localhost:8000/monitoring/retraining_trigger

Troubleshooting

Error: RuntimeError: Model artifacts are missingSolution: Run training first:
python -m src.train
Verify artifacts were created:
ls -l artifacts/
Error: OSError: [Errno 98] Address already in useSolution: Use a different port:
uvicorn src.api:app --port 8001
Error: ModuleNotFoundError: No module named 'src'Solution: Run from the repository root directory:
cd /path/to/repository
python -m src.train
Issue: Model performance is poorInvestigation steps:
  1. Check metrics.json for CV scores
  2. Verify data quality in ml_datasource.csv
  3. Adjust model hyperparameters in config.yaml
  4. Review feature engineering in src/features.py

Next Steps

System Architecture

Learn how components connect across the ML lifecycle

Configuration Guide

Customize model training, features, and monitoring

Deployment Guide

Export to ONNX and deploy optimized models

Benchmarking

Measure latency, throughput, and accuracy trade-offs

Build docs developers (and LLMs) love