Overview
The pipeline provides built-in benchmarking capabilities to measure performance across batch and streaming modes. Thebenchmark() method automatically runs multiple trials and generates statistical reports with confidence intervals.
Running Benchmarks
Basic Benchmark Execution
Command-Line Benchmarking
Benchmark Methodology
Statistical Analysis
The benchmark method performs the following analysis:- Multiple Runs: Executes both batch and streaming modes for the configured number of runs
- Bootstrap Confidence Intervals: Computes 95% CI using 400 bootstrap resamples
- Permutation Testing: Tests statistical significance between batch and streaming performance
- Scalability Analysis: Measures latency and throughput across different data sizes
Implementation Details
Fromengine.py:324-374:
Bootstrap Confidence Intervals
The_bootstrap_ci() method (engine.py:108-132) provides robust statistics:
- Sample size: Number of observations
- Mean: Average value across runs
- Standard deviation: Measure of variability
- Median: 50th percentile
- P95: 95th percentile (tail latency)
- CI95 low/high: 95% confidence interval bounds
Permutation Testing
The_permutation_pvalue() method (engine.py:134-147) determines if performance differences are statistically significant:
- Uses 1000 permutations by default
- Computes p-value for mean difference
- P-value < 0.05 indicates significant difference
Generated Artifacts
Benchmark CSV Files
After runningrun_all(), several benchmark files are created in output_dir/benchmarks/:
streaming_chunks.csv
Per-chunk metrics for streaming mode:
chunk_id: Sequential chunk identifierlatency_s: Time to process chunkthroughput_rows_s: Rows processed per secondmemory_exceeded: Whether chunk exceeded memory limitretries: Number of retry attempts due to memory pressure
latency_vs_data_size.csv
Scalability analysis across different data sizes:
throughput_vs_memory.csv
Memory vs. performance trade-offs:
resource_vs_accuracy.csv
Resource consumption vs. model accuracy:
significance_tests.csv
Statistical significance results:
- P-values < 0.05 indicate statistically significant differences
- Negative
throughput_mean_deltameans streaming is faster - Positive
latency_mean_deltameans streaming has higher latency
Visualization Artifacts
The benchmark process also generates PNG visualizations (see_plot_experiment_results() in engine.py:511-555):
latency_vs_accuracy.png: Trade-off between speed and model qualitymemory_vs_accuracy.png: Memory consumption impact on accuracylatency_memory_accuracy.png: Three-way relationship visualization
Scalability Analysis
The benchmark method tests scalability by running the pipeline on progressively larger subsets:- Identify performance bottlenecks at different scales
- Estimate resource requirements for production data sizes
- Validate linear vs. non-linear scaling behavior
Reproducibility
All benchmarks include a reproducibility manifest (engine.py:149-170):output_dir/metadata/run_manifest.json and enables:
- Exact reproduction of benchmark results
- Environment comparison across systems
- Version-specific regression testing
Interpreting Results
When Batch Mode is Faster
- Small datasets that fit comfortably in memory
- No memory constraints
- Simpler model training without streaming overhead
When Streaming Mode is Faster
- Large datasets exceeding available memory
- Memory-constrained environments
- Real-time or incremental processing requirements
Key Metrics to Monitor
- Latency: Total time to complete processing
- Throughput: Rows processed per second
- Peak memory: Maximum memory usage during execution
- Model accuracy: R² score for regression quality
- P-values: Statistical significance of performance differences
Best Practices
- Run multiple trials: Set
benchmark_runs >= 3for statistical validity - Control for variability: Use fixed
random_seedfor reproducibility - Warm up system: Discard first run if system is cold
- Isolate workload: Close other applications during benchmarking
- Monitor system: Check for background processes affecting results
- Document environment: Save reproducibility manifest with results
Next Steps
- Hardware Profiling - Detailed operator-level profiling
- Optimization Strategies - Performance tuning techniques
- Constraint Experiments - Testing under resource limits