Overview
The Hospital Data Analysis Platform provides a command-line interface for running the complete analysis pipeline, generating dataset manifests, and executing early warning experiments.Commands
run
Executes the complete hospital data analysis pipeline including data ingestion, preprocessing, feature engineering, model training, anomaly detection, and deployment monitoring.- Reproducibility context
- Predictive model metrics
- Anomaly detection alerts
- Detection latency statistics
- Streaming performance metrics
- Hardware utilization
- CPU inference statistics
- ONNX export status
- Benchmark results
- Latency-accuracy tradeoff
- Energy consumption metrics
- Hardware profile
- Risk modeling summary
- Deployment monitoring
- Early warning experiment results
- Dataset manifest information
manifest
Generates a versioned manifest of all CSV files in the data directory with SHA-256 checksums and file sizes.early-warning-experiment
Runs a comprehensive early warning system experiment across multiple hardware constraint scenarios (memory limits, compute budgets, and streaming intervals).Pipeline Workflow
When running therun command, the following steps are executed:
- Data Ingestion: Load hospital data from CSV files (general, prenatal, sports)
- Data Merging: Align and merge datasets with consistent column schemas
- Data Cleaning: Handle missing values, standardize formats, convert data types
- Feature Engineering: Build age ranges, adult indicators, and BMI risk categories
- Model Training: Train predictive models for risk and outcome prediction
- Model Evaluation: Compute accuracy, precision, recall, and other metrics
- Anomaly Detection: Detect outliers and anomalies in patient data
- Early Warning Simulation: Generate early warning alerts based on anomaly scores
- Detection Latency Evaluation: Measure time to detect synthetic events
- Batch vs. Streaming Comparison: Analyze performance differences
- Hardware Profiling: Auto-adjust batch sizes and compute utilization
- CPU Inference: Measure inference latency on CPU
- ONNX Export: Export trained model to ONNX format for deployment
- Risk Stratification: Categorize patients into risk bands
- Streaming Inference: Score records in streaming mode
- Deployment Monitoring: Build monitoring summary with alerts
- Benchmarking: Run repeated benchmarks with confidence intervals
- Tradeoff Analysis: Compute latency-accuracy tradeoffs
- Energy Analysis: Compare energy consumption across precision levels
- Hardware Experiments: Test early warning under various constraints
- Manifest Creation: Generate dataset version manifest
- Logging: Save all results to experiment log
Configuration
All commands use settings from theconfig.py module, which provides:
data_dir: Directory containing input CSV filesoutput_dir: Directory for output artifactsrandom_seed: Seed for reproducibilityfeature_columns: List of feature column namestarget_risk: Target column for risk predictiontarget_outcome: Target column for outcome predictionstream_chunk_size: Chunk size for streaming processingbenchmark_runs: Number of benchmark iterationsconfidence_level: Confidence level for statisticsexperiment_memory_limits_mb: Memory limits for experimentsexperiment_compute_budgets: Compute budgets for experimentsexperiment_stream_speeds_ms: Streaming interval speeds for experiments