Prerequisites
Before starting, ensure you have:- Python 3.8 or higher installed
- pip package manager
- 500MB+ available disk space
- NBA2K dataset CSV file
If you haven’t installed the pipeline yet, see the Installation guide.
Quick Start
Run Your First Pipeline
Execute the pipeline with default settings:This runs the pipeline with default configuration:
- Chunk size: 128 rows
- Batch size: 256 rows
- Max memory: 1024 MB
- Single-threaded execution
Configuration Templates
The pipeline includes pre-configured templates for common deployment scenarios.Edge Template Configuration
Optimized for resource-constrained devices:Server Template Configuration
Optimized for high-performance processing:Common Workflows
Override Template Settings
Use a template but adjust specific parameters:Reproducible Research Runs
Ensure deterministic results with fixed seed:Memory-Constrained Execution
Enable aggressive memory management:Parallel Benchmark Sweeps
Accelerate constraint experiments with parallel execution:Using
n_jobs > 1 increases throughput but introduces timing variance in benchmarks. Use single-threaded execution for strict timing reproducibility.Programmatic Usage
For integration into existing Python applications:Load Configuration from Template
Verify Installation
Run the test suite to verify everything works:Understanding the Output
Pipeline Report
Thepipeline_report.json contains comprehensive run statistics:
Streaming Chunks
Thestreaming_chunks.jsonl logs per-chunk telemetry:
Troubleshooting
Out of Memory Errors
Out of Memory Errors
Reduce
chunk_size and batch_size, or enable --spill-to-disk:Slow Performance
Slow Performance
For faster processing on capable hardware:
- Increase
chunk_sizeandbatch_size - Increase
max_memory_mb - Enable parallel execution with
--n-jobs 4 - Use the server template as baseline
Missing RAPL Energy Metrics
Missing RAPL Energy Metrics
Energy telemetry requires Intel RAPL. If unavailable (containers, AMD/ARM):
- Pipeline uses coarse fallback estimation
- All other metrics remain accurate
- This is expected behavior, not an error
Non-Deterministic Results
Non-Deterministic Results
Ensure reproducibility by:
- Setting
--random-seedto fixed value - Using
--n-jobs 1(parallel execution adds variance) - Keeping
--benchmark-runsconstant across comparisons
Next Steps
Configuration Reference
Complete guide to all configuration options
Architecture
Deep dive into pipeline internals
Benchmarking
Understanding performance artifacts
Deployment
Production deployment patterns