Data Ingestion

The data ingestion module handles loading CSV files from different hospital departments and merging them into a unified dataset. It also provides dataset versioning through file hashing and manifest generation.

Loading Hospital Data

The load_hospital_data() function loads CSV files from three hospital departments: general, prenatal, and sports.

Function Signature

def load_hospital_data(data_dir: Path) -> dict[str, pd.DataFrame]

Parameters:

data_dir (Path): Directory containing the hospital CSV files

Returns:

Dictionary mapping hospital names to DataFrames

Supported Hospital Files

The loader expects these CSV files in the data directory:

general.csv - General hospital data
prenatal.csv - Prenatal care data
sports.csv - Sports medicine data

Usage Example

from pathlib import Path
from ingestion.loader import load_hospital_data

data_dir = Path("data/hospital")
datasets = load_hospital_data(data_dir)

print(datasets.keys())  # dict_keys(['general', 'prenatal', 'sports'])
print(datasets["general"].shape)

From cli.py:50:

datasets = load_hospital_data(CONFIG.data_dir)

Merging Datasets

The merge_hospital_data() function combines multiple hospital datasets into a single DataFrame by aligning column names and concatenating rows.

Function Signature

def merge_hospital_data(datasets: dict[str, pd.DataFrame]) -> pd.DataFrame

Parameters:

datasets (dict[str, pd.DataFrame]): Dictionary of hospital DataFrames from load_hospital_data()

Returns:

Merged DataFrame with aligned columns and reset index

Merging Behavior

Uses the general hospital’s columns as the reference schema
Renames all other datasets’ columns to match the general schema
Concatenates all datasets vertically with ignore_index=True
Removes the Unnamed: 0 column if present (common pandas artifact)

Usage Example

from ingestion.loader import load_hospital_data, merge_hospital_data

datasets = load_hospital_data(data_dir)
merged = merge_hospital_data(datasets)

print(f"Total records: {len(merged)}")
print(f"Columns: {list(merged.columns)}")

From cli.py:50-51:

datasets = load_hospital_data(CONFIG.data_dir)
merged = merge_hospital_data(datasets)

Dataset Versioning

The versioning module provides dataset integrity tracking through SHA-256 file hashing and JSON manifests.

Creating a Dataset Manifest

def create_dataset_manifest(data_dir: Path, output_path: Path) -> dict

Parameters:

data_dir (Path): Directory containing CSV files to track
output_path (Path): Path where the manifest JSON will be saved

Returns:

Dictionary containing dataset metadata

Manifest Structure

The generated manifest includes:

{
  "dataset_dir": "data/hospital",
  "files": [
    {
      "name": "general.csv",
      "sha256": "a1b2c3...",
      "size": 524288
    },
    {
      "name": "prenatal.csv",
      "sha256": "d4e5f6...",
      "size": 393216
    },
    {
      "name": "sports.csv",
      "sha256": "g7h8i9...",
      "size": 458752
    }
  ]
}

File Hashing

The hash_file() function computes SHA-256 checksums:

def hash_file(path: Path) -> str

Parameters:

path (Path): Path to the file to hash

Returns:

Hexadecimal SHA-256 hash string

Usage Example

from pathlib import Path
from ingestion.versioning import create_dataset_manifest, hash_file

# Create a manifest for all CSV files
manifest = create_dataset_manifest(
    data_dir=Path("data/hospital"),
    output_path=Path("output/dataset_manifest.json")
)

print(f"Tracked {len(manifest['files'])} files")

# Hash a single file
file_hash = hash_file(Path("data/hospital/general.csv"))
print(f"File hash: {file_hash}")

From cli.py:115 and cli.py:158:

# In the main pipeline
manifest = create_dataset_manifest(
    CONFIG.data_dir, 
    CONFIG.output_dir / "dataset_manifest.json"
)

# As a standalone command
if args.command == "manifest":
    manifest = create_dataset_manifest(
        CONFIG.data_dir, 
        CONFIG.output_dir / "dataset_manifest.json"
    )
    print(json.dumps(manifest, indent=2))

Complete Pipeline Example

Here’s how ingestion is used in the full pipeline:

from pathlib import Path
from ingestion.loader import load_hospital_data, merge_hospital_data
from ingestion.versioning import create_dataset_manifest

# Configuration
data_dir = Path("data/hospital")
output_dir = Path("output")

# Load individual hospital datasets
datasets = load_hospital_data(data_dir)
print(f"Loaded {len(datasets)} hospital datasets")

# Merge into unified dataset
merged = merge_hospital_data(datasets)
print(f"Merged dataset shape: {merged.shape}")

# Create versioning manifest
manifest = create_dataset_manifest(
    data_dir, 
    output_dir / "dataset_manifest.json"
)
print(f"Created manifest tracking {len(manifest['files'])} files")

This produces a clean, unified dataset ready for preprocessing and feature engineering.

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

Loading Hospital Data

Function Signature

Supported Hospital Files

Usage Example

Merging Datasets

Function Signature

Merging Behavior

Usage Example

Dataset Versioning

Creating a Dataset Manifest

Manifest Structure

File Hashing

Usage Example

Complete Pipeline Example

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

​Loading Hospital Data

​Function Signature

​Supported Hospital Files

​Usage Example

​Merging Datasets

​Function Signature

​Merging Behavior

​Usage Example

​Dataset Versioning

​Creating a Dataset Manifest

​Manifest Structure

​File Hashing

​Usage Example

​Complete Pipeline Example

Build docs developers (and LLMs) love

Loading Hospital Data

Function Signature

Supported Hospital Files

Usage Example

Merging Datasets

Function Signature

Merging Behavior

Usage Example

Dataset Versioning

Creating a Dataset Manifest

Manifest Structure

File Hashing

Usage Example

Complete Pipeline Example