Quickstart

Prerequisites

Before starting, ensure you have:

Python 3.8 or higher installed
pip package manager
500MB+ available disk space
NBA2K dataset CSV file

If you haven’t installed the pipeline yet, see the Installation guide.

Quick Start

Clone and Navigate

Navigate to the pipeline directory:

cd "NBA Data Preprocessing/task"

Run Your First Pipeline

Execute the pipeline with default settings:

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --output-dir artifacts

This runs the pipeline with default configuration:

Chunk size: 128 rows
Batch size: 256 rows
Max memory: 1024 MB
Single-threaded execution

View Results

Check the generated artifacts:

ls -R artifacts/

Expected output structure:

artifacts/
├── reports/
│   ├── pipeline_report.json
│   └── streaming_chunks.jsonl
├── benchmarks/
│   ├── constraint_experiment.csv
│   ├── significance_tests.csv
│   └── *.png (visualization plots)
└── profiles/
    └── operator_profile.csv

Configuration Templates

The pipeline includes pre-configured templates for common deployment scenarios.

cd "NBA Data Preprocessing/task"
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.edge.template.json

Edge Template Configuration

Optimized for resource-constrained devices:

{
  "chunk_size": 64,
  "batch_size": 96,
  "max_memory_mb": 256,
  "max_compute_units": 0.4,
  "spill_to_disk": true,
  "n_jobs": 1
}

Server Template Configuration

Optimized for high-performance processing:

{
  "chunk_size": 256,
  "batch_size": 512,
  "max_memory_mb": 4096,
  "max_compute_units": 1.0,
  "spill_to_disk": false,
  "n_jobs": 4
}

CLI arguments override template values, allowing you to customize specific parameters while using a template baseline.

Common Workflows

Override Template Settings

Use a template but adjust specific parameters:

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.edge.template.json \
  --chunk-size 128 \
  --max-memory-mb 512

Reproducible Research Runs

Ensure deterministic results with fixed seed:

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --random-seed 42 \
  --benchmark-runs 5 \
  --output-dir experiment_001

Keep benchmark_runs fixed when comparing results across experiments. Changing this value affects statistical significance tests.

Memory-Constrained Execution

Enable aggressive memory management:

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --chunk-size 32 \
  --max-memory-mb 256 \
  --spill-to-disk \
  --max-compute-units 0.3

Parallel Benchmark Sweeps

Accelerate constraint experiments with parallel execution:

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --n-jobs 4 \
  --benchmark-runs 10 \
  --max-memory-mb 4096

Using n_jobs > 1 increases throughput but introduces timing variance in benchmarks. Use single-threaded execution for strict timing reproducibility.

Programmatic Usage

For integration into existing Python applications:

from pathlib import Path
from pipeline.config import PipelineConfig
from pipeline.streaming import RealTimePipelineRunner

# Create configuration
config = PipelineConfig(
    random_seed=42,
    chunk_size=128,
    batch_size=256,
    max_memory_mb=1024,
    max_compute_units=1.0,
    benchmark_runs=5,
    n_jobs=1,
    spill_to_disk=False,
    adaptive_chunk_resize=True,
    output_dir=Path('artifacts')
)

# Run pipeline
runner = RealTimePipelineRunner(config)
report = runner.run_all('data/nba2k-full.csv')

# Access results
print(f"Pipeline completed in {report['total_runtime_seconds']}s")
print(f"Model accuracy: {report['model_metrics']['accuracy']}")

Load Configuration from Template

import json
from pathlib import Path
from pipeline.config import PipelineConfig

# Load template
template_path = Path('configs/pipeline.edge.template.json')
template_values = json.loads(template_path.read_text())

# Override specific values
template_values['random_seed'] = 123
template_values['output_dir'] = Path('custom_output')

# Create config
config = PipelineConfig(**template_values)

Verify Installation

Run the test suite to verify everything works:

cd "NBA Data Preprocessing/task"
python -m unittest discover -s test -p 'test_*.py'

Expected output:

----------------------------------------------------------------------
Ran X tests in Y.ZZZs

OK

Understanding the Output

Pipeline Report

The pipeline_report.json contains comprehensive run statistics:

{
  "dataset_fingerprint": "sha256:abc123...",
  "total_runtime_seconds": 45.2,
  "peak_memory_mb": 847,
  "total_energy_joules": 1234.5,
  "model_metrics": {
    "rmse": 0.123,
    "r2_score": 0.876
  },
  "stage_timings": {
    "ingestion": 2.1,
    "preprocessing": 15.3,
    "feature_engineering": 18.7,
    "validation": 3.2,
    "evaluation": 5.9
  }
}

Streaming Chunks

The streaming_chunks.jsonl logs per-chunk telemetry:

{"chunk_id": 0, "rows": 128, "memory_mb": 156, "duration_ms": 234}
{"chunk_id": 1, "rows": 128, "memory_mb": 159, "duration_ms": 241}

Troubleshooting

Out of Memory Errors

Reduce chunk_size and batch_size, or enable --spill-to-disk:

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --chunk-size 32 \
  --batch-size 64 \
  --spill-to-disk

Slow Performance

For faster processing on capable hardware:

Increase chunk_size and batch_size
Increase max_memory_mb
Enable parallel execution with --n-jobs 4
Use the server template as baseline

Missing RAPL Energy Metrics

Energy telemetry requires Intel RAPL. If unavailable (containers, AMD/ARM):

Pipeline uses coarse fallback estimation
All other metrics remain accurate
This is expected behavior, not an error

Non-Deterministic Results

Ensure reproducibility by:

Setting --random-seed to fixed value
Using --n-jobs 1 (parallel execution adds variance)
Keeping --benchmark-runs constant across comparisons

Next Steps

Configuration Reference

Complete guide to all configuration options

Architecture

Deep dive into pipeline internals

Benchmarking

Understanding performance artifacts

Deployment

Production deployment patterns

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

Prerequisites

Quick Start

Configuration Templates

Edge Template Configuration

Server Template Configuration

Common Workflows

Override Template Settings

Reproducible Research Runs

Memory-Constrained Execution

Parallel Benchmark Sweeps

Programmatic Usage

Load Configuration from Template

Verify Installation

Understanding the Output

Pipeline Report

Streaming Chunks

Troubleshooting

Next Steps

Configuration Reference

Architecture

Benchmarking

Deployment

Build docs developers (and LLMs) love

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

​Prerequisites

​Quick Start

​Configuration Templates

​Edge Template Configuration

​Server Template Configuration

​Common Workflows

​Override Template Settings

​Reproducible Research Runs

​Memory-Constrained Execution

​Parallel Benchmark Sweeps

​Programmatic Usage

​Load Configuration from Template

​Verify Installation

​Understanding the Output

​Pipeline Report

​Streaming Chunks

​Troubleshooting

​Next Steps

Configuration Reference

Architecture

Benchmarking

Deployment

Build docs developers (and LLMs) love

Prerequisites

Quick Start

Configuration Templates

Edge Template Configuration

Server Template Configuration

Common Workflows

Override Template Settings

Reproducible Research Runs

Memory-Constrained Execution

Parallel Benchmark Sweeps

Programmatic Usage

Load Configuration from Template

Verify Installation

Understanding the Output

Pipeline Report

Streaming Chunks

Troubleshooting

Next Steps