Workflow Stages
The deployment pipeline converts trained scikit-learn models into optimized ONNX artifacts for production inference:1. Model Training
Produce the baseline scikit-learn artifact:artifacts/best_model.joblib- trained pipelineartifacts/threshold.txt- optimal decision thresholdartifacts/lineage.json- experiment metadata
2. ONNX Export
Convert scikit-learn model to ONNX format:artifacts/model.onnx- portable ONNX graphartifacts/onnx_features.json- feature schemaartifacts/onnx_metadata.json- threshold and provenance
3. INT8 Quantization
Apply dynamic quantization for reduced memory and latency:artifacts/model.int8.onnx- quantized model
4. Parity Validation
Verify numerical agreement between scikit-learn and ONNX:artifacts/parity_report.json- validation metrics- Exit code 0 if passed, 1 if failed
5. Performance Benchmarking
Measure inference latency and throughput:Artifact Dependencies
Dependency Graph
Dependency Graph
Backend Selection Guide
scikit-learn
Use when:
- Development and experimentation
- Python-native environment
- No latency constraints
- Requires full Python stack
- Larger memory footprint
- Slower inference
ONNX
Use when:
- Cross-platform deployment
- C++/C# production services
- Moderate latency requirements
- Operator support limitations
- Conversion complexity
- Requires parity validation
Quantized ONNX
Use when:
- Edge/mobile deployment
- Memory-constrained environments
- Latency-critical applications
- Potential accuracy degradation
- INT8 precision limits
- Must verify prediction margins
Design Trade-offs
Common Failure Modes
| Issue | Cause | Mitigation |
|---|---|---|
| Missing operator support | skl2onnx doesn’t cover all sklearn transformers | Use supported pipeline components or implement custom converter |
| Parity failure | Preprocessing mismatch or float32 vs float64 precision | Inspect feature distributions, check categorical encoding |
| Runtime package drift | Different ONNX Runtime versions between dev and prod | Pin onnxruntime version in requirements.txt |
| Quantization accuracy drop | Model relies on precise float values near threshold | Widen tolerance or skip quantization for critical features |
Assumptions and Limitations
- Workflow targets CPU execution with ONNX Runtime CPU provider
- Model contract assumes stable feature ordering and preprocessing semantics
- Parity thresholds should be re-tuned when feature engineering changes materially
- Quantization applies to weights only (dynamic quantization), not activations
Next Steps
ONNX Export
Learn how to convert scikit-learn models to ONNX format
Quantization
Optimize models with INT8 dynamic quantization
Parity Validation
Validate numerical agreement between backends