Skip to main content

Prerequisites

Before starting, ensure you have:
  • Python 3.8 or higher installed
  • pip package manager
  • 500MB+ available disk space
  • NBA2K dataset CSV file
If you haven’t installed the pipeline yet, see the Installation guide.

Quick Start

1

Clone and Navigate

Navigate to the pipeline directory:
cd "NBA Data Preprocessing/task"
2

Run Your First Pipeline

Execute the pipeline with default settings:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --output-dir artifacts
This runs the pipeline with default configuration:
  • Chunk size: 128 rows
  • Batch size: 256 rows
  • Max memory: 1024 MB
  • Single-threaded execution
3

View Results

Check the generated artifacts:
ls -R artifacts/
Expected output structure:
artifacts/
├── reports/
│   ├── pipeline_report.json
│   └── streaming_chunks.jsonl
├── benchmarks/
│   ├── constraint_experiment.csv
│   ├── significance_tests.csv
│   └── *.png (visualization plots)
└── profiles/
    └── operator_profile.csv

Configuration Templates

The pipeline includes pre-configured templates for common deployment scenarios.
cd "NBA Data Preprocessing/task"
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.edge.template.json

Edge Template Configuration

Optimized for resource-constrained devices:
{
  "chunk_size": 64,
  "batch_size": 96,
  "max_memory_mb": 256,
  "max_compute_units": 0.4,
  "spill_to_disk": true,
  "n_jobs": 1
}

Server Template Configuration

Optimized for high-performance processing:
{
  "chunk_size": 256,
  "batch_size": 512,
  "max_memory_mb": 4096,
  "max_compute_units": 1.0,
  "spill_to_disk": false,
  "n_jobs": 4
}
CLI arguments override template values, allowing you to customize specific parameters while using a template baseline.

Common Workflows

Override Template Settings

Use a template but adjust specific parameters:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.edge.template.json \
  --chunk-size 128 \
  --max-memory-mb 512

Reproducible Research Runs

Ensure deterministic results with fixed seed:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --random-seed 42 \
  --benchmark-runs 5 \
  --output-dir experiment_001
Keep benchmark_runs fixed when comparing results across experiments. Changing this value affects statistical significance tests.

Memory-Constrained Execution

Enable aggressive memory management:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --chunk-size 32 \
  --max-memory-mb 256 \
  --spill-to-disk \
  --max-compute-units 0.3

Parallel Benchmark Sweeps

Accelerate constraint experiments with parallel execution:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --n-jobs 4 \
  --benchmark-runs 10 \
  --max-memory-mb 4096
Using n_jobs > 1 increases throughput but introduces timing variance in benchmarks. Use single-threaded execution for strict timing reproducibility.

Programmatic Usage

For integration into existing Python applications:
from pathlib import Path
from pipeline.config import PipelineConfig
from pipeline.streaming import RealTimePipelineRunner

# Create configuration
config = PipelineConfig(
    random_seed=42,
    chunk_size=128,
    batch_size=256,
    max_memory_mb=1024,
    max_compute_units=1.0,
    benchmark_runs=5,
    n_jobs=1,
    spill_to_disk=False,
    adaptive_chunk_resize=True,
    output_dir=Path('artifacts')
)

# Run pipeline
runner = RealTimePipelineRunner(config)
report = runner.run_all('data/nba2k-full.csv')

# Access results
print(f"Pipeline completed in {report['total_runtime_seconds']}s")
print(f"Model accuracy: {report['model_metrics']['accuracy']}")

Load Configuration from Template

import json
from pathlib import Path
from pipeline.config import PipelineConfig

# Load template
template_path = Path('configs/pipeline.edge.template.json')
template_values = json.loads(template_path.read_text())

# Override specific values
template_values['random_seed'] = 123
template_values['output_dir'] = Path('custom_output')

# Create config
config = PipelineConfig(**template_values)

Verify Installation

Run the test suite to verify everything works:
cd "NBA Data Preprocessing/task"
python -m unittest discover -s test -p 'test_*.py'
Expected output:
----------------------------------------------------------------------
Ran X tests in Y.ZZZs

OK

Understanding the Output

Pipeline Report

The pipeline_report.json contains comprehensive run statistics:
{
  "dataset_fingerprint": "sha256:abc123...",
  "total_runtime_seconds": 45.2,
  "peak_memory_mb": 847,
  "total_energy_joules": 1234.5,
  "model_metrics": {
    "rmse": 0.123,
    "r2_score": 0.876
  },
  "stage_timings": {
    "ingestion": 2.1,
    "preprocessing": 15.3,
    "feature_engineering": 18.7,
    "validation": 3.2,
    "evaluation": 5.9
  }
}

Streaming Chunks

The streaming_chunks.jsonl logs per-chunk telemetry:
{"chunk_id": 0, "rows": 128, "memory_mb": 156, "duration_ms": 234}
{"chunk_id": 1, "rows": 128, "memory_mb": 159, "duration_ms": 241}

Troubleshooting

Reduce chunk_size and batch_size, or enable --spill-to-disk:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --chunk-size 32 \
  --batch-size 64 \
  --spill-to-disk
For faster processing on capable hardware:
  • Increase chunk_size and batch_size
  • Increase max_memory_mb
  • Enable parallel execution with --n-jobs 4
  • Use the server template as baseline
Energy telemetry requires Intel RAPL. If unavailable (containers, AMD/ARM):
  • Pipeline uses coarse fallback estimation
  • All other metrics remain accurate
  • This is expected behavior, not an error
Ensure reproducibility by:
  • Setting --random-seed to fixed value
  • Using --n-jobs 1 (parallel execution adds variance)
  • Keeping --benchmark-runs constant across comparisons

Next Steps

Configuration Reference

Complete guide to all configuration options

Architecture

Deep dive into pipeline internals

Benchmarking

Understanding performance artifacts

Deployment

Production deployment patterns

Build docs developers (and LLMs) love