Installation - Hospital Data Analysis Platform

System Requirements

Python Version

Python 3.10 or higher is required. The platform uses modern Python features and type hints that depend on 3.10+.

Verify your Python version:

python --version
# or
python3 --version

Expected output: Python 3.10.x or higher

Hardware Requirements

Minimum

2 CPU cores
2GB RAM
1GB disk space

Installation Steps

Access the project directory

Navigate to the task directory:

cd "Data Analysis for Hospitals/task"

Create a virtual environment (recommended)

Isolate dependencies using a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies

Install all required packages from requirements.txt:

pip install -r requirements.txt

This may take 2-5 minutes depending on your internet connection and system performance.

Verify installation

Confirm all packages are installed correctly:

python -c "import numpy, pandas, sklearn, matplotlib; print('All core dependencies installed')"

Dependencies

The platform requires the following Python packages with pinned versions for reproducibility:

Core Dependencies

Package	Version	Purpose
numpy	1.26.4	Numerical computing and array operations
pandas	2.2.2	Data manipulation and CSV processing
scikit-learn	1.5.1	Machine learning models and preprocessing

Visualization & Monitoring

Package	Version	Purpose
matplotlib	3.9.2	Plotting and visualization
seaborn	0.13.2	Statistical graphics
psutil	6.0.0	Hardware profiling and resource monitoring

Deployment & Testing

Package	Version	Purpose
skl2onnx	1.17.0	ONNX model export for production inference
pytest	8.3.2	Unit testing and validation

Version compatibility is critical. The CI pipeline uses pinned dependencies for deterministic execution. Version mismatches may cause serialization errors or numerical inconsistencies.

Configuration

Directory Structure

After installation, ensure the following structure exists:

Data Analysis for Hospitals/task/
├── cli.py                    # Main command-line interface
├── config.py                 # Configuration parameters
├── requirements.txt          # Dependency specifications
├── test/                     # Input data directory
│   ├── general.csv          # Hospital general data
│   ├── prenatal.csv         # Prenatal care data
│   └── sports.csv           # Sports medicine data
├── artifacts/               # Output directory (auto-created)
├── ingestion/
├── preprocessing/
├── feature_engineering/
├── modeling/
├── anomaly_detection/
├── real_time/
├── deployment/
├── evaluation/
└── utils/

The artifacts/ directory is automatically created when you run the pipeline. You don’t need to create it manually.

Data Directory Setup

Place your hospital CSV files in the test/ directory:

ls test/
# Expected output:
# general.csv  prenatal.csv  sports.csv

CSV Schema Requirements:

Files must follow the expected column schema used in feature generation
Column names should match the feature engineering expectations
Missing columns will trigger schema drift errors

Permissions

File System Access

Ensure your runtime has sufficient permissions:

# Check write permissions for output directory
touch artifacts/test_write.txt && rm artifacts/test_write.txt

If this fails, adjust permissions:

chmod -R u+w "Data Analysis for Hospitals/task/"

Python Package Installation

If you encounter permission errors during pip install, use one of these approaches:

pip install --user -r requirements.txt

Validation

Run the Test Suite

Verify your installation with the test suite:

pytest

All tests should pass with deterministic behavior via explicit seed control.

Generate a Dataset Manifest

Test data loading and validation:

python cli.py manifest

Expected output: JSON manifest with file metadata, row counts, and checksums.

Run a Quick Pipeline Test

Execute the full pipeline to confirm everything works:

python cli.py run

Data loading

Watch for messages about loading hospital data from CSV files.

Model training

Monitor predictive model training and evaluation.

Artifact generation

Verify outputs are written to artifacts/ directory.

Troubleshooting

ImportError: No module named ‘xxx’

Cause: Dependency not installed or wrong Python environment active Solution:

# Ensure virtual environment is activated
source venv/bin/activate  # Linux/macOS
# or
.\venv\Scripts\Activate.ps1  # Windows

# Reinstall dependencies
pip install -r requirements.txt

Version Conflicts

Cause: Existing packages conflict with pinned versions Solution:

# Create fresh virtual environment
rm -rf venv
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Permission Denied on artifacts/

Cause: Insufficient write permissions Solution:

mkdir -p "Data Analysis for Hospitals/task/artifacts"
chmod -R u+w "Data Analysis for Hospitals/task/artifacts"

Python 3.10+ Not Available

Solution:

sudo apt update
sudo apt install python3.10 python3.10-venv

Environment Variables (Optional)

Customize behavior with environment variables:

# Set custom random seed for reproducibility
export RANDOM_SEED=42

# Adjust hardware constraints
export MEMORY_LIMIT_MB=512
export COMPUTE_BUDGET=0.8

# Configure output directory
export OUTPUT_DIR=./custom_artifacts

Most users don’t need to set environment variables. The platform uses sensible defaults from config.py.

Continuous Integration

The repository uses standard Python tooling:

Testing Framework: pytest with unittest
CI Target: Python 3.10 with pinned dependencies
Deterministic Behavior: Explicit seed and threading environment controls

For CI/CD integration, ensure your pipeline uses the same pinned dependency versions to maintain reproducibility.

Next Steps

Quick Start

Run your first pipeline in under 5 minutes

Configuration Guide

Customize pipeline parameters and constraints

CLI Reference

Detailed command documentation

Operations Guide

Production deployment best practices

Getting Help

Before deploying to production:

Review docs/OPERATIONS.md for deployment considerations
Expand default benchmarks for production sign-off
Validate hardware estimates with device-calibrated measurements

For issues or questions:

Check troubleshooting sections in this guide
Review error messages for schema drift or missing dependencies
Verify your CSV files match expected schemas

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

​System Requirements

​Python Version

​Hardware Requirements

Minimum

Recommended

​Operating System

​Installation Steps

​Dependencies

​Core Dependencies

​Visualization & Monitoring

​Deployment & Testing

​Configuration

​Directory Structure

​Data Directory Setup

​Permissions

​File System Access

​Python Package Installation

​Validation

​Run the Test Suite

​Generate a Dataset Manifest

​Run a Quick Pipeline Test

​Troubleshooting

​ImportError: No module named ‘xxx’

​Version Conflicts

​Permission Denied on artifacts/

​Python 3.10+ Not Available

​Environment Variables (Optional)

​Continuous Integration

​Next Steps

Quick Start

Configuration Guide

CLI Reference

Operations Guide

​Getting Help

Build docs developers (and LLMs) love

System Requirements

Python Version

Hardware Requirements

Operating System

Installation Steps

Dependencies

Core Dependencies

Visualization & Monitoring

Deployment & Testing

Configuration

Directory Structure

Data Directory Setup

Permissions

File System Access

Python Package Installation

Validation

Run the Test Suite

Generate a Dataset Manifest

Run a Quick Pipeline Test

Troubleshooting

ImportError: No module named ‘xxx’

Version Conflicts

Permission Denied on artifacts/

Python 3.10+ Not Available

Environment Variables (Optional)

Continuous Integration

Next Steps

Getting Help