Overview
This guide provides optimization strategies based on profiling data and hardware constraints. Use the hardware profiling outputs to identify bottlenecks, then apply these strategies to improve performance.Hardware-Adjusted Sizing
The pipeline automatically adjusts chunk and batch sizes based on available resources.Implementation
Fromengine.py:42-60:
Sizing Formulas
Memory factor:min(1.0, max_memory_mb / 1024)
- 512 MB → factor = 0.5
- 1024 MB → factor = 1.0
- 2048 MB → factor = 1.0 (capped)
max_compute_units (0.0 to 1.0)
- 0.5 → factor = 0.5 (half available cores)
- 1.0 → factor = 1.0 (all cores)
memory_factor × compute_factor
Adjusted sizes:
chunk_size = max(16, base_chunk_size × scale)batch_size = max(16, base_batch_size × scale)
Example
Withchunk_size=128, max_memory_mb=512, max_compute_units=0.5:
Optimization by Scenario
Low-Memory Systems
Indicators:- Frequent
memory_exceeded=trueinstreaming_chunks.csv - High
retriescount - Process killed by OOM (Out of Memory)
-
Enable disk spilling:
From
engine.py:260-266, this saves intermediate results to disk: -
Reduce chunk size:
-
Enable adaptive chunk resizing:
From
engine.py:251-258, chunks are automatically split when memory is exceeded: -
Use streaming mode exclusively:
CPU-Constrained Systems
Indicators:- High
cpu_percentin telemetry - Low throughput despite adequate memory
- Long
feature_engineering_sorencode_scale_stimes
-
Reduce compute allocation:
-
Use smaller batch sizes:
-
Disable parallel processing:
From
engine.py:397-402, the constraint experiment uses this setting:
Encode Stage Bottleneck
Indicators:encode_scale_sdominates inoperator_profile.csv- Time increases non-linearly with chunk size
-
Reduce chunk size to improve cache locality:
-
Profile chunk size impact:
-
Monitor cache pressure:
- Increasing
encode_scale_swith larger chunks suggests cache thrashing - See Hardware Profiling - Cache Effects
- Increasing
Feature Engineering Bottleneck
Indicators:feature_engineering_sdominates inoperator_profile.csv- High CPU usage during this stage
-
Use simpler rolling aggregations:
- The
build_features_streaming()method computes rolling statistics - Consider reducing the window size or number of features
- The
-
Pre-compute features offline:
- For batch processing, compute features once and cache
-
Parallelize feature computation:
- Increase
n_jobsif memory allows
- Increase
I/O Bottleneck
Indicators:- Low
estimated_input_bandwidth_mb_s(< 100 MB/s) - High latency despite low CPU and memory usage
- Divergence between bandwidth estimate and throughput
-
Increase chunk size to amortize I/O:
-
Use faster storage:
- SSD instead of HDD
- Local storage instead of network drives
-
Pre-load data into memory:
Multi-Objective Optimization
Latency vs. Accuracy
From the benchmark visualizations (engine.py:512-525):- Smaller chunks → faster iteration but potential accuracy loss
- Larger chunks → better feature context but higher latency
Memory vs. Accuracy
From the benchmark visualizations (engine.py:527-540):- Lower memory limits require smaller chunks
- May reduce model quality for streaming models
- Batch models maintain accuracy but can’t run
Tuning Workflow
Step 1: Establish Baseline
Step 2: Identify Bottleneck
Step 3: Apply Targeted Optimization
Based on bottleneck:- preprocess_s: Reduce chunk size or optimize cleaning logic
- feature_engineering_s: Simplify features or increase parallelism
- feature_selection_s: Reduce correlation threshold
- encode_scale_s: Reduce chunk size (cache pressure)
Step 4: Validate Improvement
Step 5: Compare Results
Advanced Techniques
Dynamic Chunk Sizing
The adaptive chunk resize feature (engine.py:251-258) automatically reduces chunk size when memory is exceeded:- Chunk exceeds memory limit
- Split chunk in half
- Process first half
- Add second half back to queue
- Retry up to
max_chunk_retriestimes
Online Learning for Streaming
The streaming mode uses SGDRegressor for incremental learning (engine.py:214):partial_fit() (engine.py:245):
Monitoring in Production
Key Metrics
-
Per-chunk latency: Should remain stable
-
Memory stability: No increasing trend
-
Retry rate: Should be near zero
Alerting Thresholds
- Latency P95 > 2× median: Performance degradation
- Memory exceeded > 10%: Undersized configuration
- Retry rate > 5%: Frequent memory pressure
- Bandwidth < 50 MB/s: I/O bottleneck
Best Practices
- Always profile first: Use operator profiling to guide optimization
- Optimize the bottleneck: Focus on the dominant stage
- Test on representative data: Use production-scale samples
- Validate accuracy: Ensure optimizations don’t harm model quality
- Document baselines: Save reports before and after optimization
- Monitor continuously: Track metrics over time in production
Next Steps
- Benchmarking - Measure performance improvements
- Hardware Profiling - Identify new bottlenecks
- Constraint Experiments - Test edge cases