Quickstart Guide

This guide walks you through installation, training your first model, starting the API server, and making predictions.

Prerequisites

Before you begin, ensure you have:

Python 3.8 or higher
pip package manager
Git (optional, for cloning the repository)

Installation

Clone the repository

git clone <repository-url>
cd <repository-name>

Install dependencies

Upgrade pip and install all required packages:

python -m pip install --upgrade pip
pip install -r requirements.txt

View key dependencies

The project uses these core libraries:

FastAPI 0.115.0 - API server framework
scikit-learn 1.5.2 - Machine learning models
pandas 2.2.2 - Data processing
ONNX Runtime 1.19.2 - Optimized inference
uvicorn 0.30.6 - ASGI server

Verify installation

Check that the configuration file is present:

cat config.yaml

You should see configuration for models, data paths, and artifacts.

Training Your First Model

The training pipeline trains multiple classifiers and selects the best performer based on cross-validation.

Run Training

python -m src.train

Expected Output

The training script will:

Load data from ml_datasource.csv (src/data.py:28)
Train 5 different models: Logistic Regression, KNN, SVM, Decision Tree, and Random Forest (src/train.py:58-97)
Perform 5-fold stratified cross-validation (src/train.py:130-134)
Select the best model based on ROC-AUC score (src/train.py:156)
Calibrate decision threshold to target 90% precision (config.yaml:34)

The training process uses a configurable seed (default: 42) for reproducibility. All random operations are deterministic.

Example output:

{
  "run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "best_model_name": "Random Forest",
  "calibration": {
    "type": "threshold",
    "target_precision": 0.9,
    "threshold": 0.6524
  },
  "accuracy": 0.9234,
  "roc_auc": 0.9567,
  "precision": 0.9012,
  "recall": 0.8345,
  "f1": 0.8665,
  "cv_ranking": [
    {
      "model": "Random Forest",
      "cv_roc_auc_mean": 0.9523,
      "cv_precision_mean": 0.8934,
      "cv_recall_mean": 0.8456,
      "cv_f1_mean": 0.8687
    }
  ]
}

Generated Artifacts

Training creates several files in the artifacts/ directory (src/train.py:172-224):

File	Description	Reference
`best_model.joblib`	Trained sklearn pipeline with preprocessor	config.yaml:37
`threshold.txt`	Calibrated decision threshold	config.yaml:38
`metrics.json`	Performance metrics and CV results	config.yaml:39
`lineage.json`	SHA-256 hashes for reproducibility	config.yaml:41
`drift_baseline.json`	Training distribution statistics	config.yaml:40

You must complete training before starting the API server. The API requires best_model.joblib and threshold.txt to load.

Starting the API Server

The FastAPI server provides endpoints for predictions, batch inference, and drift monitoring.

Launch the Server

uvicorn src.api:app --host 0.0.0.0 --port 8000 --reload

Verify Server Health

Check that the model loaded successfully:

curl http://localhost:8000/health

Expected response:

{
  "ready": true,
  "predictor_loaded": true,
  "drift_baseline_loaded": true
}

If ready is false, the model artifacts are missing. Re-run training.

Making Your First Prediction

The API accepts student engagement features and returns purchase probability.

Single Prediction

Send a POST request to /predict with student features:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "student_country": "US",
    "days_on_platform": 12,
    "minutes_watched": 366.7,
    "courses_started": 5,
    "practice_exams_started": 1,
    "practice_exams_passed": 0,
    "minutes_spent_on_exams": 3.27
  }'

Response:

{
  "predicted_purchase_probability": 0.8234,
  "predicted_purchase": 1
}

Understanding the response

predicted_purchase_probability: Raw model probability (0.0 to 1.0)
predicted_purchase: Binary prediction (0 or 1) using calibrated threshold

The prediction is 1 if probability ≥ threshold (src/api.py:252)

Batch Predictions

Process multiple records in a single request:

curl -X POST "http://localhost:8000/batch_predict" \
  -H "Content-Type: application/json" \
  -d '{
    "records": [
      {
        "student_country": "US",
        "days_on_platform": 12,
        "minutes_watched": 366.7,
        "courses_started": 5,
        "practice_exams_started": 1,
        "practice_exams_passed": 0,
        "minutes_spent_on_exams": 3.27
      },
      {
        "student_country": "IN",
        "days_on_platform": 259,
        "minutes_watched": 118.0,
        "courses_started": 2,
        "practice_exams_started": 2,
        "practice_exams_passed": 1,
        "minutes_spent_on_exams": 16.48
      }
    ]
  }'

Response:

{
  "predictions": [
    {
      "predicted_purchase_probability": 0.8234,
      "predicted_purchase": 1
    },
    {
      "predicted_purchase_probability": 0.2156,
      "predicted_purchase": 0
    }
  ]
}

Input Validation

The API enforces schema validation using Pydantic (src/api.py:27-35):

Required Fields

student_country (2-64 chars)
days_on_platform (≥0)
minutes_watched (≥0)
courses_started (≥0)
practice_exams_started (≥0)
practice_exams_passed (≥0)
minutes_spent_on_exams (≥0)

Business Rules

practice_exams_passed cannot exceed practice_exams_started
Validation occurs at src/api.py:243-248

Handling Validation Errors

Invalid requests return HTTP 422:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "student_country": "US",
    "days_on_platform": -5,
    "minutes_watched": 100.0,
    "courses_started": 1,
    "practice_exams_started": 1,
    "practice_exams_passed": 2,
    "minutes_spent_on_exams": 10.0
  }'

Error Response:

{
  "detail": "practice_exams_passed cannot exceed practice_exams_started."
}

Monitoring and Drift Detection

The API tracks prediction statistics and feature distributions in real-time.

Check Drift Status

curl http://localhost:8000/monitoring/drift

Response:

{
  "samples_observed": 127,
  "drift_score_max_abs_z": 2.34,
  "drifted_features": ["minutes_watched"],
  "predicted_positive_rate": 0.15,
  "training_positive_rate": 0.12,
  "should_retrain": false,
  "reason": "below_threshold",
  "recommended_action": "continue_monitoring"
}

Drift detection requires at least 50 samples (config.yaml:44). The system compares running feature means to training baselines using z-scores (src/api.py:91-172).

Retraining Triggers

The system recommends retraining when:

≥2 features drift beyond z-score threshold of 3.0 (config.yaml:45-46)
Prediction rate shifts >10% from training rate (config.yaml:47)

Check retraining status:

curl http://localhost:8000/monitoring/retraining_trigger

Troubleshooting

Model artifacts missing

Error: RuntimeError: Model artifacts are missingSolution: Run training first:

python -m src.train

Verify artifacts were created:

ls -l artifacts/

Port already in use

Error: OSError: [Errno 98] Address already in useSolution: Use a different port:

uvicorn src.api:app --port 8001

Import errors

Error: ModuleNotFoundError: No module named 'src'Solution: Run from the repository root directory:

cd /path/to/repository
python -m src.train

Low prediction accuracy

Issue: Model performance is poorInvestigation steps:

Check metrics.json for CV scores
Verify data quality in ml_datasource.csv
Adjust model hyperparameters in config.yaml
Review feature engineering in src/features.py

Next Steps

System Architecture

Learn how components connect across the ML lifecycle

Configuration Guide

Customize model training, features, and monitoring

Deployment Guide

Export to ONNX and deploy optimized models

Benchmarking

Measure latency, throughput, and accuracy trade-offs

Getting Started

Core Concepts

Training

Deployment

Optimization

Runtime

Prerequisites

Installation

Training Your First Model

Run Training

Expected Output

Generated Artifacts

Starting the API Server

Launch the Server

Verify Server Health

Making Your First Prediction

Single Prediction

Batch Predictions

Input Validation

Required Fields

Business Rules

Handling Validation Errors

Monitoring and Drift Detection

Check Drift Status

Retraining Triggers

Troubleshooting

Next Steps

System Architecture

Configuration Guide

Deployment Guide

Benchmarking

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Deployment

Optimization

Runtime

​Prerequisites

​Installation

​Training Your First Model

​Run Training

​Expected Output

​Generated Artifacts

​Starting the API Server

​Launch the Server

​Verify Server Health

​Making Your First Prediction

​Single Prediction

​Batch Predictions

​Input Validation

Required Fields

Business Rules

​Handling Validation Errors

​Monitoring and Drift Detection

​Check Drift Status

​Retraining Triggers

​Troubleshooting

​Next Steps

System Architecture

Configuration Guide

Deployment Guide

Benchmarking

Build docs developers (and LLMs) love

Prerequisites

Installation

Training Your First Model

Run Training

Expected Output

Generated Artifacts

Starting the API Server

Launch the Server

Verify Server Health

Making Your First Prediction

Single Prediction

Batch Predictions

Input Validation

Handling Validation Errors

Monitoring and Drift Detection

Check Drift Status

Retraining Triggers

Troubleshooting

Next Steps