Overview
The Hospital Data Analysis Platform is a production-oriented analytics pipeline designed for CPU-constrained environments. The architecture prioritizes deterministic behavior, auditability, and incremental validation over one-off model results.Design Principles
CPU-First Execution
The platform is intentionally optimized for CPU execution rather than GPU acceleration. This design choice:- Prioritizes compatibility with common deployment targets
- Reduces hardware variance in production environments
- Enables deployment on resource-constrained edge devices
- Simplifies infrastructure requirements
Explicit Hardware Awareness
Hardware constraints are treated as first-class experiment parameters rather than afterthoughts. The system explicitly models:- Memory limits (MB)
- Compute budgets (operation counts)
- Batch size constraints
- Stream processing intervals
Staged Pipeline Architecture
The pipeline uses explicit stage boundaries to isolate failure domains. This prevents data-quality errors from being conflated with model-quality regressions.Core Components
Data Layer
Located intask/ingestion/ and task/preprocessing/:
- Ingestion: CSV loading and dataset manifest generation
- Preprocessing: Data cleaning, schema normalization, and consistency checks
- Versioning: Dataset versioning and change tracking
Feature Engineering
Located intask/feature_engineering/:
- Derived feature construction (age ranges, BMI risk categories)
- Feature normalization and standardization
- Categorical encoding for hospital and demographic features
feature_engineering/features.py
Modeling Layer
Located intask/modeling/:
- Predictive Models: Custom logistic regression implementation optimized for CPU
- Risk Stratification: Multi-band risk classification (high/medium/low)
- Model Evaluation: Accuracy, F1, and AUC metrics
Anomaly Detection
Located intask/anomaly_detection/:
- Outlier detection for early warning systems
- Detection latency evaluation
- Alert threshold calibration
Real-Time Processing
Located intask/real_time/:
- Streaming inference with configurable chunk sizes
- Batch vs. streaming performance comparison
- Online scoring with latency tracking
Deployment Layer
Located intask/deployment/:
- CPU Inference: Optimized prediction runtime
- ONNX Export: Model serialization for cross-platform deployment
- Monitoring: Alert tracking, latency profiling, and reliability metrics
Architectural Trade-offs
Stage Boundaries
Cost: Stage boundaries add I/O and serialization overhead Benefit: Improved debuggability, restartability, and isolation of failure domainsConservative Batch Sizes
Cost: Longer total runtime for large datasets Benefit: Stability under memory pressure and predictable resource usageModel Simplicity
Cost: Simpler models may underfit rare patterns Benefit: Reduced inference latency and lower computational requirementsStream Chunking
Cost: Increased overhead and potential jitter at very small chunk sizes Benefit: Improved online responsiveness and incremental result availabilityRepository Structure
Configuration Management
The system uses a centralized configuration dataclass inconfig.py:
config.py
Artifacts and Outputs
All pipeline outputs are written totask/artifacts/:
- Model checkpoints and serialized estimators
- Experiment logs and benchmark results
- Hardware profile tables
- Dataset manifests and versioning metadata
- Monitoring summaries and alert logs
Next Steps
Pipeline Stages
Detailed breakdown of each processing stage
Hardware Constraints
Learn about hardware-aware optimization