Overview
The Hospital Data Analysis Platform implements a five-stage pipeline design that processes data through distinct phases. Each stage has clear inputs, outputs, and failure modes, enabling better debugging and maintainability.Stage Flow
Stage 1: Ingestion
Purpose
Load raw hospital data from multiple CSV sources and create a unified dataset manifest.Location
task/ingestion/
Key Functions
Inputs
general.csv- General hospital department dataprenatal.csv- Prenatal department datasports.csv- Sports medicine department data
Outputs
- Merged DataFrame with unified schema
- Dataset manifest JSON (for versioning)
Failure Modes
- Missing or malformed CSV files
- Schema drift between hospital sources
- Missing expected columns
CLI Command
Stage 2: Preprocessing
Purpose
Clean raw data, normalize categorical values, and handle missing data according to domain-specific rules.Location
task/preprocessing/cleaning.py
Key Operations
Data Quality Rules
| Column Type | Missing Value Strategy |
|---|---|
| Numeric (bmi, children, months) | Fill with 0 |
| Test results (blood_test, ecg, etc.) | Fill with “unknown” |
| Diagnosis | Fill with “unknown” |
| Gender | Fill with “f” (default) |
Inputs
- Raw merged DataFrame from ingestion
Outputs
- Cleaned DataFrame with consistent types and no missing values
Failure Modes
- Unexpected categorical values
- Extreme outliers in numeric columns
- Column type mismatches
Stage 3: Feature Engineering
Purpose
Construct derived features from raw columns to improve predictive power and enable risk stratification.Location
task/feature_engineering/features.py
Feature Derivations
- Age Features
- BMI Risk
Inputs
- Cleaned DataFrame from preprocessing
Outputs
- Enhanced DataFrame with derived features
Failure Modes
- Binning errors from extreme values
- Type conversion failures
Stage 4: Modeling
Purpose
Train predictive models for risk and outcome prediction, evaluate performance, and detect anomalies.Location
task/modeling/ and task/anomaly_detection/
Model Architecture
The platform uses a custom SimpleLogisticModel optimized for CPU execution:modeling/predictive.py
Training Pipeline
- Feature preparation: Normalize numeric features, one-hot encode categoricals
- Target construction:
- Risk target: Binary indicator for high-risk diagnoses (appendicitis, pregnancy)
- Outcome target: Positive blood test results
- Train/test split: 75/25 with random shuffling (seed=42)
- Model training: Separate models for risk and outcome prediction
- Evaluation: Accuracy, F1 score, and AUC-ROC for both models
Anomaly Detection
Inputs
- Feature matrix from feature engineering
- Configuration parameters (seed, test_size, feature_columns)
Outputs
ModelArtifactscontaining:- Trained risk and outcome models
- Test set splits (X_test, y_risk_test, y_outcome_test)
- Performance metrics (accuracy, F1, AUC)
- Anomaly scores and early warning alerts
Failure Modes
- Convergence issues with extreme learning rates
- Class imbalance leading to poor F1 scores
- High false-positive rates in anomaly detection
Stage 5: Deployment
Purpose
Export models for production inference, benchmark performance, and establish monitoring infrastructure.Location
task/deployment/ and task/evaluation/
Key Components
CPU Inference
CPU Inference
deployment/cpu_inference.py
ONNX Export
ONNX Export
Monitoring Summary
Monitoring Summary
Benchmarking
The platform runs repeated benchmarks to account for system noise:Inputs
- Trained model artifacts
- Test dataset
- Benchmark configuration
Outputs
- ONNX model file
- Inference latency statistics
- Monitoring summary JSON
- Benchmark results with confidence intervals
- Hardware profile tables
Failure Modes
- ONNX export failures for unsupported operators
- Latency spikes from CPU contention
- Serialization format mismatches
Pipeline Execution
Full Pipeline
Run all stages sequentially:Early Warning Experiment
Run hardware-constrained early warning experiments:Stage Isolation Benefits
Debuggability
Each stage can be tested independently with known inputs
Restartability
Failed stages can be rerun without repeating earlier work
Failure Isolation
Data quality issues don’t cascade into model quality issues
Incremental Validation
Outputs can be validated at each stage boundary
Next Steps
Hardware Constraints
Learn how the pipeline adapts to memory and compute limitations