Skip to main content

Overview

Performance tuning involves balancing multiple competing objectives: latency, throughput, accuracy, memory footprint, and energy consumption. This guide provides practical strategies for interpreting benchmark results and making informed deployment decisions.

Core Trade-offs

Latency vs Accuracy

The most common trade-off in ML deployment is between inference speed and prediction quality.
Quantization reduces precision but increases speed:
FP32 (Full Precision)
├─ Accuracy: 85.6%
├─ Latency: 0.095 ms/sample
└─ Memory: 3.22 MB

INT8 (Quantized)
├─ Accuracy: 85.4% (↓ 0.2%)
├─ Latency: 0.068 ms/sample (↑ 28% faster)
└─ Memory: 2.41 MB (↓ 25%)
Key Question: Is 0.2% accuracy loss acceptable for 28% latency improvement?

Throughput vs Queue Delay

High throughput doesn’t guarantee low latency. Queue delays can dominate under load.
1

Measure Isolated Throughput

Benchmark single-request throughput:
throughput_samples_per_sec = batch_size / inference_time
Example from benchmarks:
  • sklearn_fp32: 8,266 samples/sec
  • onnx_fp32: 10,526 samples/sec
  • onnx_int8: 14,706 samples/sec
2

Account for Queueing

Under concurrent load, total latency includes queue wait time:
Total Latency = Queue Delay + Inference Time
Little’s Law:
Average Queue Delay = (Arrival Rate / Throughput) × Service Time
When arrival rate approaches throughput capacity, queue delay explodes.
3

Calculate Utilization

System utilization = Arrival Rate / Throughput
# Example: 8000 requests/sec arrival rate
utilization_sklearn = 8000 / 8266 = 0.97  # 97% - dangerous!
utilization_onnx_int8 = 8000 / 14706 = 0.54  # 54% - safe
Keep utilization < 70% for stable latency. Above 80%, queue delays become unpredictable.
4

Right-Size for Peak Load

Provision for peak traffic, not average:
required_throughput = peak_qps * safety_factor
# safety_factor typically 1.5-2.0 for headroom
Choose scenario with sufficient throughput margin:
  • Peak load: 10,000 QPS
  • Safety factor: 1.5×
  • Required: 15,000 samples/sec
  • Best choice: onnx_int8 (14,706 samples/sec, close but add instances)

Memory vs Concurrency

Lower memory footprint enables higher concurrency on fixed hardware.
From hardware_tradeoffs.csv:
sklearn_fp32: 4.15 MB per inference
onnx_fp32:    3.22 MB per inference
onnx_int8:    2.41 MB per inference
Available memory: 16 GB RAMMaximum concurrent inferences:
max_concurrent_sklearn = 16_000 MB / 4.15 MB3,855
max_concurrent_onnx_fp32 = 16_000 MB / 3.22 MB4,969
max_concurrent_onnx_int8 = 16_000 MB / 2.41 MB6,639
INT8 quantization enables 72% more concurrent requests on same hardware.
The memory_pressure_index quantifies memory efficiency:
memory_pressure = memory_mb / throughput_samples_per_sec
Lower values indicate better memory utilization per unit of work.Typical values:
  • sklearn_fp32: 0.0005
  • onnx_fp32: 0.0003 (40% better)
  • onnx_int8: 0.0002 (60% better)
Use case: When scaling horizontally, lower memory pressure means fewer instances needed.
Larger batches improve throughput but increase latency and memory:
# Test multiple batch sizes
for batch_size in 32 64 128 256 512; do
    python benchmarking/statistical_benchmark.py \
        --runs 10 \
        --batch-size $batch_size
done
Typical pattern:
  • Batch 32: Low latency, low throughput
  • Batch 256: Balanced (recommended starting point)
  • Batch 1024: High throughput, high latency
Choose based on your SLA:
  • Real-time API: batch 32-64
  • Batch processing: batch 512-1024

Interpreting Benchmark Results

Reading Statistical Summaries

The stat_benchmark_summary.csv provides confidence intervals:
scenario,latency_ms_per_sample_mean,latency_ms_per_sample_ci_low,latency_ms_per_sample_ci_high
sklearn_fp32,0.121,0.118,0.124
onnx_fp32,0.095,0.092,0.098
onnx_int8,0.068,0.065,0.071
1

Check Confidence Intervals

Non-overlapping CIs indicate significant differences:
sklearn_fp32: [0.118, 0.124]
onnx_fp32:    [0.092, 0.098]  ← No overlap with sklearn
onnx_int8:    [0.065, 0.071]  ← No overlap with onnx_fp32
All three scenarios are statistically distinguishable.
2

Assess Practical Significance

Check effect sizes in statistical_comparisons.csv:
baseline,scenario,metric,cohens_d_paired
onnx_fp32,onnx_int8,latency_ms_per_sample,1.87
Cohen’s d = 1.87 is a large effect (> 0.8). This difference matters in production.
3

Validate Consistency

Check variance across runs:Wide confidence intervals suggest:
  • High run-to-run variability
  • Need more benchmark iterations
  • Possible interference from background processes
Narrow intervals indicate:
  • Consistent performance
  • Results are reproducible
  • Safe to deploy

Using the Dashboard

The composite score provides a single metric for multi-objective optimization:
composite_score = (
    0.4 * accuracy +
    0.2 * (1 / (1 + latency_ms_per_sample)) +
    0.2 * (1 / (1 + memory_mb)) +
    0.2 * (1 / (1 + energy_mj_proxy))
)
Customize weights based on your priorities:
# Latency-critical application
composite_score = 0.2 * accuracy + 0.6 * latency_score + ...

# Accuracy-critical application
composite_score = 0.7 * accuracy + 0.1 * latency_score + ...
The dashboard generates:
  1. benchmark_dashboard.csv: All scenarios with composite scores
  2. benchmark_summary.json: Best scenario by each dimension
  3. benchmark_tradeoff.png: Latency vs accuracy scatter plot
Example interpretation:
{
  "best_composite": {"scenario": "onnx_int8", "composite_score": 0.87},
  "lowest_latency": {"scenario": "onnx_int8", "latency_ms_per_sample": 0.068},
  "lowest_memory": {"scenario": "onnx_int8", "memory_mb": 2.41}
}
ONNX INT8 wins on most dimensions — strong candidate for production.

Optimization Strategies

Choosing Between Scenarios

sklearn_fp32

When to use:
  • Prototyping and experimentation
  • No performance requirements
  • Avoiding ONNX dependencies
  • Simple deployment critical
Pros:
  • No conversion step
  • Native Python debugging
  • Wide library support
Cons:
  • 2-3× slower than ONNX
  • Higher memory footprint
  • Lower throughput capacity

onnx_fp32

When to use:
  • Accuracy cannot be compromised
  • Production deployment
  • Cross-platform requirements
  • Balanced performance needs
Pros:
  • 2-3× faster than sklearn
  • Same accuracy as training
  • Optimized CPU kernels
Cons:
  • Requires ONNX conversion
  • Larger than INT8

onnx_int8

When to use:
  • Latency-critical applications
  • Resource-constrained deployment
  • High throughput requirements
  • Energy efficiency matters
Pros:
  • 3-5× faster than sklearn
  • 75% memory reduction
  • Highest throughput
  • Lowest energy consumption
Cons:
  • Slight accuracy loss (< 1%)
  • Quantization artifacts possible

Decision Framework

1

Define Requirements

Establish clear thresholds:
requirements:
  latency_p95_ms: 100
  accuracy_min: 0.850
  memory_max_mb: 10
  throughput_min_qps: 5000
  energy_budget_mj: 2.0
2

Filter by Hard Constraints

Eliminate scenarios that violate requirements:
candidates = df[
    (df['latency_p95_ms'] <= 100) &
    (df['accuracy'] >= 0.850) &
    (df['memory_mb'] <= 10) &
    (df['throughput_samples_per_sec'] >= 5000)
]
3

Rank by Optimization Objective

Sort remaining candidates by primary objective:
# If latency is most important
best = candidates.sort_values('latency_ms_per_sample').iloc[0]

# If balanced performance
best = candidates.sort_values('composite_score', ascending=False).iloc[0]
4

Validate in Staging

Test selected scenario under realistic load:
# Load testing with concurrent requests
wrk -t12 -c400 -d30s --latency http://staging-api/predict
Verify:
  • P95 latency under load
  • Memory consumption stable
  • No degradation over time

Production Deployment Best Practices

Pre-deployment Validation

Benchmark checklist:
  • Run 30+ iterations for statistical power
  • Test on production-like hardware (not dev laptops)
  • Use realistic batch sizes matching production traffic
  • Include warmup runs to account for cold start
  • Measure 95th percentile latency, not just mean
  • Test concurrent requests to validate throughput
  • Monitor memory over time for leaks
  • Validate accuracy on recent data (check for drift)

Monitoring in Production

Track percentiles, not just averages:
# Good: percentile-based SLA
assert latency_p95 < 100  # ms
assert latency_p99 < 200  # ms

# Bad: mean-based SLA (hides outliers)
assert latency_mean < 50  # ms
Alert on:
  • P95 latency > threshold
  • Latency variance increase
  • Cold start latency spikes

Handling Model Updates

1

Re-benchmark New Models

Don’t assume new model has same performance:
# Run full benchmark suite
python benchmarking/statistical_benchmark.py --runs 30
python hardware_aware_ml/tradeoff_experiments.py
python benchmarking/dashboard.py
Compare against current production baseline.
2

A/B Test in Production

Deploy new model to small traffic percentage:
if random.random() < 0.05:  # 5% traffic
    prediction = new_model.predict(X)
else:
    prediction = current_model.predict(X)
Monitor for regressions before full rollout.
3

Gradual Rollout

Increase traffic percentage incrementally:
Day 1: 5% → Monitor 24h
Day 2: 20% → Monitor 24h
Day 3: 50% → Monitor 24h
Day 4: 100% → Rollout complete
Roll back immediately if SLA violations occur.
4

Document Performance Changes

Track model performance over time:
model_v1:
  latency_p95: 92ms
  accuracy: 0.854
  deployed: 2024-01-15

model_v2:
  latency_p95: 68ms  # ↑ 26% improvement
  accuracy: 0.856     # ↑ 0.2% improvement
  deployed: 2024-02-10

Common Pitfalls

Benchmarking on wrong hardwareDev laptop results don’t reflect production:
  • Different CPU architecture (M1 vs x86)
  • Different memory bandwidth
  • Different power management settings
Solution: Always benchmark on production-equivalent hardware.
Ignoring cold start latencyFirst inference after model load is slower:
  • JIT compilation
  • Cache warming
  • Lazy initialization
Solution: Include --warmup-runs in benchmarks and monitor cold start separately.
Batch size mismatchBenchmarking with batch 256 but serving requests individually:
  • Batch processing improves throughput
  • Single requests have higher latency
Solution: Benchmark with realistic batch sizes matching production traffic.
Not accounting for queueingIsolated inference time ≠ end-to-end latency:
  • Queue delay under load
  • Network latency
  • Serialization overhead
Solution: Load test with concurrent requests to measure real-world latency.
Focusing only on mean metricsAverage latency hides outliers:
  • P95/P99 reveals tail latency
  • Outliers impact user experience
Solution: Always track percentiles, not just means.

Advanced Optimization Techniques

Operator-Level Optimization

Use operator latency breakdown from hardware_tradeoffs.csv:
scenario,preprocess_latency_ms,linear_operator_latency_ms,postprocess_latency_ms
sklearn_fp32,0.027,0.070,0.024
onnx_int8,0.014,0.037,0.017
Optimization strategies:
  • Cache preprocessed features
  • Move normalization to client side
  • Use faster serialization (Protobuf vs JSON)
  • Batch preprocessing operations
Example:
# Before: 27ms preprocessing per batch
X_normalized = scaler.transform(X_raw)

# After: 8ms (cache scaler params, use NumPy)
X_normalized = (X_raw - mean) / std
Optimization strategies:
  • Apply INT8 quantization
  • Use model compression (pruning, distillation)
  • Enable ONNX graph optimizations
  • Consider GPU acceleration for large models
Example:
# Enable ONNX graph optimization
python -m onnxruntime.quantization.optimize_model \
    --input model.onnx \
    --output model_optimized.onnx \
    --optimization_level 99
Optimization strategies:
  • Optimize probability thresholding
  • Use vectorized NumPy operations
  • Minimize data copying
  • Defer formatting until necessary
Example:
# Before: 24ms postprocessing
predictions = [int(p > threshold) for p in probs]

# After: 8ms (vectorized)
predictions = (probs >= threshold).astype(int)

Multi-Model Serving

When serving multiple models, memory and concurrency constraints differ:
# Memory-constrained: prefer INT8
total_memory_mb = num_models * memory_per_model
max_models_fp32 = 16_000 / 3.22  # 4,969 models
max_models_int8 = 16_000 / 2.41  # 6,639 models  ← 34% more
Strategy: Use INT8 for high-cardinality model serving (per-user models, etc.).

Performance Tuning Workflow

1

Establish Baseline

Run comprehensive benchmarks:
python benchmarking/statistical_benchmark.py --runs 30 --batch-size 256
python hardware_aware_ml/tradeoff_experiments.py
2

Identify Bottleneck

Analyze operator-level breakdown:
  • Preprocessing slow → optimize data loading
  • Linear operator slow → quantize or compress model
  • Postprocessing slow → vectorize operations
3

Apply Optimization

Implement targeted improvement:
  • INT8 quantization for latency
  • Batch size tuning for throughput
  • Memory-mapped loading for footprint
4

Re-benchmark

Measure improvement:
python benchmarking/statistical_benchmark.py --runs 30
Compare statistical significance against baseline.
5

Validate Accuracy

Ensure optimization didn’t degrade predictions:
accuracy_delta = new_accuracy - baseline_accuracy
assert abs(accuracy_delta) < 0.01  # < 1% change
6

Deploy and Monitor

Roll out incrementally with monitoring:
  • A/B test 5% traffic
  • Monitor latency, accuracy, errors
  • Gradually increase to 100%

Next Steps

Statistical Benchmarking

Deep dive into rigorous performance measurement methodology

Hardware-Aware Optimization

Explore energy profiling and operator-level analysis

Build docs developers (and LLMs) love