Installation

System Requirements

Before installing, verify your system meets these requirements:

Python Version

Python 3.8 or higher

Memory

Minimum 512MB RAM (1GB+ recommended)

Storage

500MB+ available disk space

Operating System

Linux, macOS, or Windows with WSL

Windows users should use WSL (Windows Subsystem for Linux) for optimal compatibility. Native Windows support may have limitations.

Installation Steps

Verify Python Installation

Check your Python version:

python --version
# or
python3 --version

Expected output: Python 3.8.x or higher

If Python is not installed, download it from python.org or use your system’s package manager.

Navigate to Project Directory

Change to the NBA Data Preprocessing directory:

cd "NBA Data Preprocessing/task"

Create Virtual Environment (Recommended)

Isolate dependencies using a virtual environment:

python -m venv venv

Activate the environment:

source venv/bin/activate

Your terminal prompt should change to indicate the virtual environment is active (e.g., (venv)).

Install Dependencies

Install all required packages from requirements.txt:

pip install -r ../../requirements.txt

This installs:

numpy==1.26.4 - Numerical computing
pandas==2.2.2 - Data manipulation
scikit-learn==1.5.1 - Machine learning
matplotlib==3.9.2 - Visualization
psutil==6.0.0 - System monitoring
joblib==1.4.2 - Parallel processing
requests==2.32.3 - HTTP utilities

Verify Installation

Run the test suite to confirm everything is set up correctly:

python -m unittest discover -s test -p 'test_*.py'

Expected output:

----------------------------------------------------------------------
Ran X tests in Y.ZZZs

OK

Dependency Details

Core Dependencies

numpy 1.26.4

Provides high-performance array operations and numerical computing primitives. Used throughout the pipeline for efficient data manipulation.

import numpy as np
# Used for vectorized operations and memory-efficient processing

pandas 2.2.2

DataFrame library for structured data processing. Handles CSV ingestion, chunk processing, and data transformations.

import pandas as pd
# Powers the streaming chunk reader and preprocessing stages

scikit-learn 1.5.1

Machine learning framework providing:

Linear regression models (baseline)
Preprocessing utilities (scaling, encoding)
Model evaluation metrics
Incremental learning with partial_fit

from sklearn.linear_model import SGDRegressor
# Used for online learning in streaming mode

matplotlib 3.9.2

Visualization library for generating benchmark plots:

Latency vs accuracy charts
Memory vs accuracy analysis
3D resource-accuracy visualizations

import matplotlib.pyplot as plt
# Creates PNG artifacts in benchmarks/ directory

System Monitoring Dependencies

psutil 6.0.0

Cross-platform system monitoring for:

Real-time memory tracking
CPU utilization measurement
Process resource profiling

import psutil
# Enables adaptive chunk resizing based on available memory

joblib 1.4.2

Provides parallel execution capabilities:

Multi-process benchmark sweeps
Efficient serialization
Progress tracking

from joblib import Parallel, delayed
# Used when n_jobs > 1 for parallel constraint experiments

Optional: Energy Monitoring

Energy telemetry requires Intel RAPL (Running Average Power Limit) support. This is optional and the pipeline works without it.

Check RAPL Availability

On Linux systems with Intel CPUs:

ls /sys/class/powercap/intel-rapl

If the directory exists, RAPL is available. Otherwise, the pipeline uses fallback estimation.

Enable RAPL Access

For accurate energy measurements, you may need to enable access:

sudo chmod -R a+r /sys/class/powercap/intel-rapl

RAPL is not available in:

Docker containers (without privileged mode)
AMD or ARM processors
Virtual machines (most configurations)

The pipeline automatically falls back to estimation in these cases.

Configuration

Directory Structure

After installation, your project should have this structure:

NBA Data Preprocessing/
├── data/
│   └── nba2k-full.csv
├── task/
│   ├── pipeline/
│   │   ├── config.py
│   │   ├── streaming/
│   │   ├── preprocessing/
│   │   ├── feature_engineering/
│   │   ├── ingestion/
│   │   ├── validation/
│   │   └── hardware/
│   ├── test/
│   ├── run_pipeline.py
│   └── preprocess.py
└── configs/
    ├── pipeline.edge.template.json
    └── pipeline.server.template.json

Verify Data File

Ensure the NBA2K dataset is present:

ls -lh "../data/nba2k-full.csv"

The file should contain columns: version, salary, b_day, draft_year, height, weight

Platform-Specific Notes

Linux

Most straightforward installation. All features supported:

# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install python3-pip python3-venv

# Proceed with standard installation
pip install -r requirements.txt

macOS

Using Homebrew
Using pyenv

# Install Python via Homebrew
brew install [email protected]

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install pyenv
brew install pyenv

# Install Python version
pyenv install 3.11.7
pyenv local 3.11.7

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

macOS does not support RAPL energy monitoring. The pipeline uses CPU-based estimation instead.

Windows (WSL Recommended)

Install WSL2 and Ubuntu:

# In PowerShell (Administrator)
wsl --install

Then follow Linux installation steps inside WSL.

For native Windows support without WSL, ensure you have Microsoft Visual C++ 14.0+ (required for some numpy/pandas builds).

Troubleshooting

pip install fails with SSL errors

Upgrade pip and try again:

pip install --upgrade pip
pip install -r requirements.txt

Or use a different index:

pip install -r requirements.txt --index-url https://pypi.org/simple

ImportError: No module named 'sklearn'

Verify the virtual environment is activated:

which python
# Should show path to venv/bin/python

If not activated:

source venv/bin/activate  # Linux/macOS
.\venv\Scripts\Activate.ps1  # Windows PowerShell

Permission denied when installing packages

Don’t use sudo with pip. Instead:

Use a virtual environment (recommended)
Or install with --user flag: pip install --user -r requirements.txt

numpy/pandas build failures

These packages require C compilers. Install build tools:Ubuntu/Debian:

sudo apt-get install build-essential python3-dev

macOS:

xcode-select --install

Windows: Download Microsoft C++ Build Tools from visualstudio.microsoft.com

Tests fail with FileNotFoundError

Ensure you’re running tests from the correct directory:

cd "NBA Data Preprocessing/task"
python -m unittest discover -s test -p 'test_*.py'

Upgrading Dependencies

To upgrade to newer package versions:

# Upgrade all packages
pip install --upgrade -r requirements.txt

# Or upgrade specific packages
pip install --upgrade pandas scikit-learn

Upgrading dependencies may introduce breaking changes. Test thoroughly after upgrades, especially for scikit-learn (model API changes) and pandas (DataFrame behavior changes).

Verifying Installation

Quick Verification Script

Create a file verify_install.py:

import sys
import importlib

required_packages = [
    ('numpy', '1.26.4'),
    ('pandas', '2.2.2'),
    ('sklearn', '1.5.1'),
    ('matplotlib', '3.9.2'),
    ('psutil', '6.0.0'),
    ('joblib', '1.4.2'),
    ('requests', '2.32.3'),
]

print(f"Python version: {sys.version}")
print("\nChecking packages...\n")

for package, expected_version in required_packages:
    try:
        mod = importlib.import_module(package)
        version = getattr(mod, '__version__', 'unknown')
        status = "✓" if version.startswith(expected_version.split('.')[0]) else "⚠"
        print(f"{status} {package:15} {version:15} (expected {expected_version})")
    except ImportError:
        print(f"✗ {package:15} NOT INSTALLED")

print("\nInstallation check complete!")

Run it:

python verify_install.py

Full Pipeline Test

Run a minimal pipeline test:

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --chunk-size 64 \
  --benchmark-runs 1 \
  --output-dir test_artifacts

If successful, you’ll see artifact files in test_artifacts/.

Next Steps

Quickstart Guide

Run your first pipeline in minutes

Configuration

Learn about all configuration options

Architecture

Understand the pipeline design

API Reference

Explore the Python API

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

System Requirements

Python Version

Memory

Storage

Operating System

Installation Steps

Dependency Details

Core Dependencies

System Monitoring Dependencies

Optional: Energy Monitoring

Check RAPL Availability

Enable RAPL Access

Configuration

Directory Structure

Verify Data File

Platform-Specific Notes

Linux

macOS

Windows (WSL Recommended)

Troubleshooting

Upgrading Dependencies

Verifying Installation

Quick Verification Script

Full Pipeline Test

Next Steps

Quickstart Guide

Configuration

Architecture

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

​System Requirements

Python Version

Memory

Storage

Operating System

​Installation Steps

​Dependency Details

​Core Dependencies

​System Monitoring Dependencies

​Optional: Energy Monitoring

​Check RAPL Availability

​Enable RAPL Access

​Configuration

​Directory Structure

​Verify Data File

​Platform-Specific Notes

​Linux

​macOS

​Windows (WSL Recommended)

​Troubleshooting

​Upgrading Dependencies

​Verifying Installation

​Quick Verification Script

​Full Pipeline Test

​Next Steps

Quickstart Guide

Configuration

Architecture

API Reference

Build docs developers (and LLMs) love

System Requirements

Installation Steps

Dependency Details

Core Dependencies

System Monitoring Dependencies

Optional: Energy Monitoring

Check RAPL Availability

Enable RAPL Access

Configuration

Directory Structure

Verify Data File

Platform-Specific Notes

Linux

macOS

Windows (WSL Recommended)

Troubleshooting

Upgrading Dependencies

Verifying Installation

Quick Verification Script

Full Pipeline Test

Next Steps