Overview
The data ingestion stage is the entry point of the NBA preprocessing pipeline. TheDataIngestor class handles loading data from multiple sources (CSV files, paths, or in-memory DataFrames) and provides streaming capabilities for large datasets.
DataIngestor Class
Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/ingestion/loader.py:19
Initialization
Random seed for reproducibility across data loading operations
Core Methods
load()
Loads data from various source types into a pandas DataFrame.Data source - can be a file path (string), Path object, or existing DataFrame
When loading from a DataFrame, the method returns a copy to prevent unintended mutations of the original data.
stream_chunks()
Streams data in configurable chunk sizes for memory-efficient processing of large datasets.Data source to stream from
Number of rows per chunk
- DataFrame source: Splits the DataFrame using
ilocwith the specified chunk size - File source: Uses pandas
read_csvwithchunksizeparameter for efficient streaming - Each chunk is returned as a copy to ensure isolation
fingerprint()
Generates a cryptographic fingerprint of the dataset for versioning and reproducibility tracking.Data source to fingerprint
DatasetFingerprint object containing:
path(str): Source path or ‘<in-memory>’ for DataFramessha256(str): SHA-256 hash of the CSV representationrows(int): Number of rows in the datasetcolumns(int): Number of columns in the dataset
- Version Control: Track dataset changes across pipeline runs
- Reproducibility: Verify that the same input data is used
- Data Integrity: Detect accidental modifications or corruption
DatasetFingerprint
Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/ingestion/loader.py:11
File path or ‘<in-memory>’ for DataFrames
SHA-256 hash of the CSV-encoded dataset
Total number of rows
Total number of columns
Data Flow
The ingestion stage follows this flow:Integration with Pipeline
The ingestion stage integrates with the streaming engine: Location:~/workspace/source/NBA Data Preprocessing/task/pipeline/streaming/engine.py:36
The ingestion stage is source-agnostic - the same API works for files, paths, and DataFrames.
Performance Considerations
Memory Efficiency
- Batch mode: Loads entire dataset into memory
- Streaming mode: Processes data in chunks, keeping only one chunk in memory at a time
When to Use Each Mode
| Mode | Best For | Memory Usage |
|---|---|---|
Batch (load) | Small to medium datasets (<1GB) | High |
Streaming (stream_chunks) | Large datasets or memory-constrained environments | Low |
Next Steps
Preprocessing
Clean and transform the ingested data
Streaming Engine
Learn about real-time pipeline execution