Quickstart

Get started with the Hospital Data Analysis Platform by running your first analytics pipeline. This guide will walk you through executing the three main CLI commands and understanding the output.

Prerequisites

Before you begin, ensure you have:

Python 3.10 or higher installed
Completed the installation steps
Hospital data CSV files in the test directory

Running Your First Pipeline

Generate Dataset Manifest

Create a manifest of your hospital data files to validate schema and track data versions.

cd "Data Analysis for Hospitals/task"
python cli.py manifest

Expected output

{
  "files": ["general.csv", "prenatal.csv", "sports.csv"],
  "total_records": 1500,
  "schema_version": "1.0",
  "checksums": {
    "general.csv": "a3b2c1d4...",
    "prenatal.csv": "e5f6g7h8...",
    "sports.csv": "i9j0k1l2..."
  }
}

This command validates your data files and generates version tracking information.

Execute Full Analytics Pipeline

Run the complete pipeline including ingestion, preprocessing, feature engineering, modeling, and deployment monitoring.

python cli.py run

The pipeline executes these stages:

Data Ingestion - Loads and merges hospital CSV files
Preprocessing - Cleans and normalizes data
Feature Engineering - Creates derived features (age_range, is_adult, bmi_risk)
Model Training - Trains risk and outcome prediction models
Anomaly Detection - Identifies outliers and generates early warnings
Streaming Inference - Compares batch vs streaming performance
Hardware Profiling - Adjusts batch sizes and tracks resource utilization
CPU Inference - Measures inference latency and throughput
ONNX Export - Serializes models for cross-platform deployment
Monitoring - Generates deployment metrics and alert summaries
Benchmarking - Runs repeated experiments with confidence intervals
Hardware Experiments - Tests performance under different constraints

Sample output

{
  "reproducibility": {
    "random_seed": 42,
    "python_version": "3.10.12",
    "numpy_version": "1.26.4"
  },
  "predictive_metrics": {
    "risk_accuracy": 0.847,
    "risk_f1": 0.723,
    "risk_auc": 0.891,
    "outcome_accuracy": 0.812,
    "outcome_f1": 0.689,
    "outcome_auc": 0.856
  },
  "anomaly_alerts": {
    "total_alerts": 45,
    "alert_rate": 0.03
  },
  "detection_latency_s": 2.4,
  "streaming": {
    "batch_time_s": 0.124,
    "stream_time_s": 0.156,
    "stream_latency_ms_per_row": 0.104,
    "stream_throughput_rows_per_s": 9615.38
  },
  "hardware": {
    "adjusted_batch_size": 64,
    "compute_utilization": 0.73
  },
  "cpu_inference": {
    "inference_latency_ms": 12.5,
    "mean_probability": 0.342,
    "std_probability": 0.187
  },
  "onnx_exported": true,
  "deployment_monitoring": {
    "alert_count": 12,
    "alert_rate": 0.032,
    "high_risk_ratio": 0.15
  }
}

Run Hardware-Constrained Experiments

Evaluate early warning system performance under different memory, compute, and latency constraints.

python cli.py early-warning-experiment

This command runs 27 experiment scenarios (3 memory limits × 3 compute budgets × 3 stream speeds) to evaluate:

Detection latency under resource constraints
Prediction accuracy vs hardware limits
False positive rates
Detection quality scores

Experiment output

{
  "summary": {
    "total_scenarios": 27,
    "mean_detection_latency_s": 3.2,
    "mean_accuracy": 0.834,
    "mean_false_positive_rate": 0.042,
    "best_scenario": {
      "memory_mb": 256,
      "compute_budget": 10000,
      "stream_interval_ms": 10
    }
  },
  "benchmark": {
    "detection_latency_s": {
      "mean": 3.2,
      "std": 0.8,
      "ci_lower": 2.8,
      "ci_upper": 3.6
    },
    "prediction_accuracy": {
      "mean": 0.834,
      "std": 0.023,
      "ci_lower": 0.820,
      "ci_upper": 0.848
    }
  },
  "artifacts": {
    "results_csv": "artifacts/early_warning_experiment_results.csv",
    "plots": "artifacts/early_warning_plots.png"
  }
}

Understanding Artifacts

After running the pipeline, artifacts are written to Data Analysis for Hospitals/task/artifacts/:

File	Description
`experiment_log.json`	Complete pipeline execution log with all metrics
`dataset_manifest.json`	Dataset version manifest with checksums
`risk_model.onnx`	Exported ONNX model for risk prediction
`hardware_profile.csv`	Hardware profiling results with operator-level metrics
`early_warning_experiment_results.csv`	Detailed results from constraint experiments
`early_warning_plots.png`	Visualization of performance across scenarios

Common Workflows

Iterative Development

# 1. Validate data schema
python cli.py manifest

# 2. Run full pipeline
python cli.py run

# 3. Review artifacts and adjust config.py as needed

# 4. Re-run experiments
python cli.py early-warning-experiment

Production Deployment

# 1. Run pipeline with production config
python cli.py run

# 2. Export model to ONNX (automatically done in pipeline)

# 3. Validate ONNX model
python -c "import onnx; model = onnx.load('artifacts/risk_model.onnx'); onnx.checker.check_model(model)"

# 4. Deploy ONNX model to target runtime
# (Use ONNX Runtime, TensorRT, or other compatible inference engines)

Troubleshooting

Schema drift or missing columns

Error: KeyError: 'column_name'Solution: Verify your CSV files match the expected schema. Run python cli.py manifest to see which files are being loaded.Required columns: age, height, weight, bmi, children, months, hospital, gender, diagnosis, blood_test

Memory pressure under constrained hardware

Error: MemoryError or slow executionSolution: Reduce memory limits in config.py:

hardware_memory_limit_mb: int = 128  # Lower from 256

The pipeline will automatically adjust batch sizes using auto_adjust_batch_size().

High latency in streaming inference

Issue: Streaming latency exceeds requirementsSolution: Adjust chunk size in config.py:

stream_chunk_size: int = 8  # Reduce from 16 for lower latency
stream_interval_ms: int = 5  # Reduce from 10 for faster updates

Note: Smaller chunks reduce latency but may decrease throughput.

Next Steps

Core Concepts

Understand the pipeline architecture and design philosophy

Configuration

Learn about all configuration options and tuning parameters

Modeling

Deep dive into predictive models and risk stratification

CLI Reference

Complete CLI command reference with all options

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

Prerequisites

Running Your First Pipeline

Understanding Artifacts

Common Workflows

Iterative Development

Production Deployment

Troubleshooting

Next Steps

Core Concepts

Configuration

Modeling

CLI Reference

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

​Prerequisites

​Running Your First Pipeline

​Understanding Artifacts

​Common Workflows

​Iterative Development

​Production Deployment

​Troubleshooting

​Next Steps

Core Concepts

Configuration

Modeling

CLI Reference

Build docs developers (and LLMs) love

Prerequisites

Running Your First Pipeline

Understanding Artifacts

Common Workflows

Iterative Development

Production Deployment

Troubleshooting

Next Steps