Overview
The constraint experiment functionality systematically tests pipeline performance across combinations of chunk sizes, memory limits, and compute constraints. This enables:- Finding optimal configurations for constrained environments
- Understanding performance trade-offs
- Validating edge and low-resource scenarios
- Identifying Pareto-optimal points
Running Constraint Experiments
Programmatic Usage
Command-Line Usage
The constraint experiment runs automatically as part ofrun_all():
Experiment Methodology
Implementation
Fromengine.py:389-413:
Parameter Grid
The experiment tests all combinations of:-
Chunk sizes:
[64, config.chunk_size]- Minimum: 64 rows
- Maximum: Configured chunk size
- Duplicates removed
-
Memory limits:
[256, config.max_memory_mb]- Low-memory scenario: 256 MB
- Configured limit: User-specified
-
Compute limits:
[0.5, config.max_compute_units]- CPU-constrained: 50% utilization
- Full utilization: User-specified
Single Run Implementation
Fromengine.py:376-387:
Generated Artifacts
constraint_experiment.csv
Complete results matrix in output_dir/benchmarks/:
chunk_size: Streaming chunk size (rows)memory_limit_mb: Maximum memory constraintcompute_limit: CPU constraint factor (0.0-1.0)preprocessing_latency_s: Total preprocessing timepeak_memory_mb: Maximum memory usage observedtraining_time_s: Model training time (includes preprocessing)model_accuracy_r2: Regression R² scoremodel_rmse: Root mean squared error
constraint_experiment_log.jsonl
JSON Lines format for programmatic analysis in output_dir/reports/:
Visualization Plots
Generated inoutput_dir/benchmarks/ (see engine.py:509-555):
latency_vs_accuracy.png
Scatter plot showing trade-off between preprocessing speed and model quality:
- X-axis: Preprocessing latency (seconds)
- Y-axis: Model accuracy (R²)
- Color: Compute constraint level
memory_vs_accuracy.png
Memory consumption vs. model quality:
- X-axis: Peak memory (MB)
- Y-axis: Model accuracy (R²)
- Color: Memory limit setting
latency_memory_accuracy.png
Three-way relationship visualization:
- X-axis: Peak memory (MB)
- Y-axis: Preprocessing latency (seconds)
- Color: Model accuracy (R²)
Analyzing Results
Finding Optimal Configuration
Pareto Frontier Analysis
Identify configurations that aren’t strictly dominated:Memory-Constrained Scenarios
Filter results by memory availability:CPU-Constrained Scenarios
Edge Scenarios
Low-Memory Systems
From the hardware profiling documentation: Recommendations:- Enable
--spill-to-disk - Reduce
--chunk-size - Keep
--max-memory-mbrealistic for resident process limits
CPU-Constrained Systems
Recommendations:- Lower
--max-compute-units - Use smaller
--batch-size - Keep
--n-jobs 1to avoid contention
Minimal Resource Scenario
Combined memory and CPU constraints:Experiment Summary
The experiment results include a summary dictionary (engine.py:407-412):Parallel Execution
Constraint experiments can run in parallel (engine.py:397-402):n_jobs=1.
Best Practices
- Run experiments on representative data: Use production-scale samples
- Test edge cases separately: Minimal resource scenarios may need custom grids
- Validate constraints: Verify peak usage doesn’t exceed limits
- Document findings: Save experiment reports for comparison
- Use Pareto analysis: Identify optimal trade-offs, not just best single metric
- Consider deployment environment: Match constraints to target hardware
Limitations
- Fixed parameter grid: Only tests predefined combinations
- No hyperparameter tuning: Model parameters are fixed
- Sequential dependencies: Each run is independent (no warm-up effects)
- Coarse granularity: Limited to 2 values per parameter
Next Steps
- Benchmarking - Full statistical analysis
- Hardware Profiling - Operator-level details
- Optimization Strategies - Apply findings