Overview
Performance tuning involves balancing multiple competing objectives: latency, throughput, accuracy, memory footprint, and energy consumption. This guide provides practical strategies for interpreting benchmark results and making informed deployment decisions.Core Trade-offs
Latency vs Accuracy
The most common trade-off in ML deployment is between inference speed and prediction quality.- Understanding the Trade-off
- When to Prioritize Latency
- When to Prioritize Accuracy
- Finding the Balance
Quantization reduces precision but increases speed:Key Question: Is 0.2% accuracy loss acceptable for 28% latency improvement?
Throughput vs Queue Delay
High throughput doesn’t guarantee low latency. Queue delays can dominate under load.Measure Isolated Throughput
Benchmark single-request throughput:Example from benchmarks:
- sklearn_fp32: 8,266 samples/sec
- onnx_fp32: 10,526 samples/sec
- onnx_int8: 14,706 samples/sec
Account for Queueing
Under concurrent load, total latency includes queue wait time:Little’s Law:When arrival rate approaches throughput capacity, queue delay explodes.
Memory vs Concurrency
Lower memory footprint enables higher concurrency on fixed hardware.Memory Footprint Analysis
Memory Footprint Analysis
From Available memory: 16 GB RAMMaximum concurrent inferences:INT8 quantization enables 72% more concurrent requests on same hardware.
hardware_tradeoffs.csv:Memory Pressure Index
Memory Pressure Index
The Lower values indicate better memory utilization per unit of work.Typical values:
memory_pressure_index quantifies memory efficiency:- sklearn_fp32: 0.0005
- onnx_fp32: 0.0003 (40% better)
- onnx_int8: 0.0002 (60% better)
Batch Size Tuning
Batch Size Tuning
Larger batches improve throughput but increase latency and memory:Typical pattern:
- Batch 32: Low latency, low throughput
- Batch 256: Balanced (recommended starting point)
- Batch 1024: High throughput, high latency
- Real-time API: batch 32-64
- Batch processing: batch 512-1024
Interpreting Benchmark Results
Reading Statistical Summaries
Thestat_benchmark_summary.csv provides confidence intervals:
Check Confidence Intervals
Non-overlapping CIs indicate significant differences:All three scenarios are statistically distinguishable.
Assess Practical Significance
Check effect sizes in Cohen’s d = 1.87 is a large effect (> 0.8). This difference matters in production.
statistical_comparisons.csv:Using the Dashboard
The composite score provides a single metric for multi-objective optimization:Customize weights based on your priorities:
- benchmark_dashboard.csv: All scenarios with composite scores
- benchmark_summary.json: Best scenario by each dimension
- benchmark_tradeoff.png: Latency vs accuracy scatter plot
Optimization Strategies
Choosing Between Scenarios
sklearn_fp32
When to use:
- Prototyping and experimentation
- No performance requirements
- Avoiding ONNX dependencies
- Simple deployment critical
- No conversion step
- Native Python debugging
- Wide library support
- 2-3× slower than ONNX
- Higher memory footprint
- Lower throughput capacity
onnx_fp32
When to use:
- Accuracy cannot be compromised
- Production deployment
- Cross-platform requirements
- Balanced performance needs
- 2-3× faster than sklearn
- Same accuracy as training
- Optimized CPU kernels
- Requires ONNX conversion
- Larger than INT8
onnx_int8
When to use:
- Latency-critical applications
- Resource-constrained deployment
- High throughput requirements
- Energy efficiency matters
- 3-5× faster than sklearn
- 75% memory reduction
- Highest throughput
- Lowest energy consumption
- Slight accuracy loss (< 1%)
- Quantization artifacts possible
Decision Framework
Production Deployment Best Practices
Pre-deployment Validation
Benchmark checklist:
- Run 30+ iterations for statistical power
- Test on production-like hardware (not dev laptops)
- Use realistic batch sizes matching production traffic
- Include warmup runs to account for cold start
- Measure 95th percentile latency, not just mean
- Test concurrent requests to validate throughput
- Monitor memory over time for leaks
- Validate accuracy on recent data (check for drift)
Monitoring in Production
- Latency Metrics
- Throughput Metrics
- Accuracy Metrics
- Resource Metrics
Track percentiles, not just averages:Alert on:
- P95 latency > threshold
- Latency variance increase
- Cold start latency spikes
Handling Model Updates
Re-benchmark New Models
Don’t assume new model has same performance:Compare against current production baseline.
A/B Test in Production
Deploy new model to small traffic percentage:Monitor for regressions before full rollout.
Gradual Rollout
Increase traffic percentage incrementally:Roll back immediately if SLA violations occur.
Common Pitfalls
Advanced Optimization Techniques
Operator-Level Optimization
Use operator latency breakdown fromhardware_tradeoffs.csv:
If preprocessing dominates (> 30%)
If preprocessing dominates (> 30%)
Optimization strategies:
- Cache preprocessed features
- Move normalization to client side
- Use faster serialization (Protobuf vs JSON)
- Batch preprocessing operations
If linear operator dominates (> 60%)
If linear operator dominates (> 60%)
Optimization strategies:
- Apply INT8 quantization
- Use model compression (pruning, distillation)
- Enable ONNX graph optimizations
- Consider GPU acceleration for large models
If postprocessing dominates (> 25%)
If postprocessing dominates (> 25%)
Optimization strategies:
- Optimize probability thresholding
- Use vectorized NumPy operations
- Minimize data copying
- Defer formatting until necessary
Multi-Model Serving
When serving multiple models, memory and concurrency constraints differ:Performance Tuning Workflow
Identify Bottleneck
Analyze operator-level breakdown:
- Preprocessing slow → optimize data loading
- Linear operator slow → quantize or compress model
- Postprocessing slow → vectorize operations
Apply Optimization
Implement targeted improvement:
- INT8 quantization for latency
- Batch size tuning for throughput
- Memory-mapped loading for footprint
Next Steps
Statistical Benchmarking
Deep dive into rigorous performance measurement methodology
Hardware-Aware Optimization
Explore energy profiling and operator-level analysis