Skip to main content

System Requirements

Before installing, verify your system meets these requirements:

Python Version

Python 3.8 or higher

Memory

Minimum 512MB RAM (1GB+ recommended)

Storage

500MB+ available disk space

Operating System

Linux, macOS, or Windows with WSL
Windows users should use WSL (Windows Subsystem for Linux) for optimal compatibility. Native Windows support may have limitations.

Installation Steps

1

Verify Python Installation

Check your Python version:
python --version
# or
python3 --version
Expected output: Python 3.8.x or higher
If Python is not installed, download it from python.org or use your system’s package manager.
2

Navigate to Project Directory

Change to the NBA Data Preprocessing directory:
cd "NBA Data Preprocessing/task"
3

Create Virtual Environment (Recommended)

Isolate dependencies using a virtual environment:
python -m venv venv
Activate the environment:
source venv/bin/activate
Your terminal prompt should change to indicate the virtual environment is active (e.g., (venv)).
4

Install Dependencies

Install all required packages from requirements.txt:
pip install -r ../../requirements.txt
This installs:
  • numpy==1.26.4 - Numerical computing
  • pandas==2.2.2 - Data manipulation
  • scikit-learn==1.5.1 - Machine learning
  • matplotlib==3.9.2 - Visualization
  • psutil==6.0.0 - System monitoring
  • joblib==1.4.2 - Parallel processing
  • requests==2.32.3 - HTTP utilities
5

Verify Installation

Run the test suite to confirm everything is set up correctly:
python -m unittest discover -s test -p 'test_*.py'
Expected output:
----------------------------------------------------------------------
Ran X tests in Y.ZZZs

OK

Dependency Details

Core Dependencies

Provides high-performance array operations and numerical computing primitives. Used throughout the pipeline for efficient data manipulation.
import numpy as np
# Used for vectorized operations and memory-efficient processing
DataFrame library for structured data processing. Handles CSV ingestion, chunk processing, and data transformations.
import pandas as pd
# Powers the streaming chunk reader and preprocessing stages
Machine learning framework providing:
  • Linear regression models (baseline)
  • Preprocessing utilities (scaling, encoding)
  • Model evaluation metrics
  • Incremental learning with partial_fit
from sklearn.linear_model import SGDRegressor
# Used for online learning in streaming mode
Visualization library for generating benchmark plots:
  • Latency vs accuracy charts
  • Memory vs accuracy analysis
  • 3D resource-accuracy visualizations
import matplotlib.pyplot as plt
# Creates PNG artifacts in benchmarks/ directory

System Monitoring Dependencies

Cross-platform system monitoring for:
  • Real-time memory tracking
  • CPU utilization measurement
  • Process resource profiling
import psutil
# Enables adaptive chunk resizing based on available memory
Provides parallel execution capabilities:
  • Multi-process benchmark sweeps
  • Efficient serialization
  • Progress tracking
from joblib import Parallel, delayed
# Used when n_jobs > 1 for parallel constraint experiments

Optional: Energy Monitoring

Energy telemetry requires Intel RAPL (Running Average Power Limit) support. This is optional and the pipeline works without it.

Check RAPL Availability

On Linux systems with Intel CPUs:
ls /sys/class/powercap/intel-rapl
If the directory exists, RAPL is available. Otherwise, the pipeline uses fallback estimation.

Enable RAPL Access

For accurate energy measurements, you may need to enable access:
sudo chmod -R a+r /sys/class/powercap/intel-rapl
RAPL is not available in:
  • Docker containers (without privileged mode)
  • AMD or ARM processors
  • Virtual machines (most configurations)
The pipeline automatically falls back to estimation in these cases.

Configuration

Directory Structure

After installation, your project should have this structure:
NBA Data Preprocessing/
├── data/
│   └── nba2k-full.csv
├── task/
│   ├── pipeline/
│   │   ├── config.py
│   │   ├── streaming/
│   │   ├── preprocessing/
│   │   ├── feature_engineering/
│   │   ├── ingestion/
│   │   ├── validation/
│   │   └── hardware/
│   ├── test/
│   ├── run_pipeline.py
│   └── preprocess.py
└── configs/
    ├── pipeline.edge.template.json
    └── pipeline.server.template.json

Verify Data File

Ensure the NBA2K dataset is present:
ls -lh "../data/nba2k-full.csv"
The file should contain columns: version, salary, b_day, draft_year, height, weight

Platform-Specific Notes

Linux

Most straightforward installation. All features supported:
# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install python3-pip python3-venv

# Proceed with standard installation
pip install -r requirements.txt

macOS

# Install Python via Homebrew
brew install [email protected]

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt
macOS does not support RAPL energy monitoring. The pipeline uses CPU-based estimation instead.
Install WSL2 and Ubuntu:
# In PowerShell (Administrator)
wsl --install
Then follow Linux installation steps inside WSL.
For native Windows support without WSL, ensure you have Microsoft Visual C++ 14.0+ (required for some numpy/pandas builds).

Troubleshooting

Upgrade pip and try again:
pip install --upgrade pip
pip install -r requirements.txt
Or use a different index:
pip install -r requirements.txt --index-url https://pypi.org/simple
Verify the virtual environment is activated:
which python
# Should show path to venv/bin/python
If not activated:
source venv/bin/activate  # Linux/macOS
.\venv\Scripts\Activate.ps1  # Windows PowerShell
Don’t use sudo with pip. Instead:
  1. Use a virtual environment (recommended)
  2. Or install with --user flag: pip install --user -r requirements.txt
These packages require C compilers. Install build tools:Ubuntu/Debian:
sudo apt-get install build-essential python3-dev
macOS:
xcode-select --install
Windows: Download Microsoft C++ Build Tools from visualstudio.microsoft.com
Ensure you’re running tests from the correct directory:
cd "NBA Data Preprocessing/task"
python -m unittest discover -s test -p 'test_*.py'

Upgrading Dependencies

To upgrade to newer package versions:
# Upgrade all packages
pip install --upgrade -r requirements.txt

# Or upgrade specific packages
pip install --upgrade pandas scikit-learn
Upgrading dependencies may introduce breaking changes. Test thoroughly after upgrades, especially for scikit-learn (model API changes) and pandas (DataFrame behavior changes).

Verifying Installation

Quick Verification Script

Create a file verify_install.py:
import sys
import importlib

required_packages = [
    ('numpy', '1.26.4'),
    ('pandas', '2.2.2'),
    ('sklearn', '1.5.1'),
    ('matplotlib', '3.9.2'),
    ('psutil', '6.0.0'),
    ('joblib', '1.4.2'),
    ('requests', '2.32.3'),
]

print(f"Python version: {sys.version}")
print("\nChecking packages...\n")

for package, expected_version in required_packages:
    try:
        mod = importlib.import_module(package)
        version = getattr(mod, '__version__', 'unknown')
        status = "✓" if version.startswith(expected_version.split('.')[0]) else "⚠"
        print(f"{status} {package:15} {version:15} (expected {expected_version})")
    except ImportError:
        print(f"✗ {package:15} NOT INSTALLED")

print("\nInstallation check complete!")
Run it:
python verify_install.py

Full Pipeline Test

Run a minimal pipeline test:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --chunk-size 64 \
  --benchmark-runs 1 \
  --output-dir test_artifacts
If successful, you’ll see artifact files in test_artifacts/.

Next Steps

Quickstart Guide

Run your first pipeline in minutes

Configuration

Learn about all configuration options

Architecture

Understand the pipeline design

API Reference

Explore the Python API

Build docs developers (and LLMs) love