Overview
Theparity_check.py script validates that ONNX models produce numerically equivalent predictions to the original scikit-learn model, preventing silent regressions during deployment.
Prerequisites
Usage
Command-Line Arguments
Maximum allowed absolute difference between predictions.When to adjust:
- Tighten (0.01-0.02) for high-stakes predictions
- Widen (0.06-0.10) after quantization or with float32 precision
Maximum allowed mean absolute difference across all predictions.When to adjust:
- Tighten (0.005) to catch systematic bias
- Widen (0.02) if few outliers are acceptable
Number of test samples to validate. Larger batches increase coverage but take longer.Recommended:
- Development: 256-512 samples
- CI pipeline: 1000+ samples
How It Works
1. Load Models and Data
deployment/parity_check.py
2. Generate Predictions
Scikit-learn:deployment/parity_check.py
deployment/parity_check.py
The
_extract_proba function handles different ONNX output formats:- 2D array with shape
(n_samples, 2)→ Extract column 1 - List of dicts
[{0: p0, 1: p1}, ...]→ Extract value for key1
3. Compute Differences
deployment/parity_check.py
4. Write Report and Exit
deployment/parity_check.py
Output Artifacts
artifacts/parity_report.json
Number of test samples compared
Maximum absolute difference across all predictions:
max(|sklearn - onnx|)Critical for: Catching worst-case outliers near decision boundaryMean absolute difference across all predictions:
mean(|sklearn - onnx|)Critical for: Detecting systematic bias or driftConfigured maximum allowed absolute difference (from
--abs-tol)Configured maximum allowed mean difference (from
--mean-tol)true if both max_abs_diff <= abs_tol and mean_abs_diff <= mean_tolScript exits with code 0 if passed, 1 if failedExample Usage
Expected Output
Passing Validation
Failing Validation
Validation Workflow
Inspect report
Check
artifacts/parity_report.json for metrics:passed: true→ Proceed to deploymentpassed: false→ Investigate root cause (see below)
Debug failures (if needed)
See Debugging Parity Failures section
Tolerance Guidelines
- Tight (High Stakes)
- Moderate (Recommended)
- Relaxed (Post-Quantization)
- Model decisions have significant business/regulatory impact
- Near-zero false positive/negative tolerance
- Predictions are used for ranking or calibration
Debugging Parity Failures
1. Identify Outlier Samples
Modifyparity_check.py to log samples with large differences:
2. Check Feature Preprocessing
Categorical encoding mismatch
Categorical encoding mismatch
Symptom: Large differences on samples with categorical features.Cause: OneHotEncoder or OrdinalEncoder handles unknown categories differently.Fix:
- Ensure
handle_unknown='infrequent_if_exist'or'ignore'in sklearn - Verify ONNX conversion preserves encoding logic
Missing value imputation
Missing value imputation
Symptom: Differences on samples with NaN values.Cause: SimpleImputer strategy not preserved in ONNX.Fix:
- Check imputer strategy (mean, median, constant)
- Verify ONNX graph includes imputation nodes
Feature scaling drift
Feature scaling drift
Symptom: Systematic bias (high mean_abs_diff, moderate max_abs_diff).Cause: StandardScaler fit on different data or not saved correctly.Fix:
- Ensure scaler is part of sklearn pipeline
- Re-export ONNX after re-training
3. Inspect ONNX Graph
Visualize ONNX graph to identify missing or incorrect nodes:- Missing preprocessing nodes (imputer, scaler, encoder)
- Incorrect input types (FloatTensorType vs StringTensorType)
- Operator version mismatches
4. Compare Float Precision
Test if FP64 → FP32 conversion causes drift:5. Validate Quantized Model
If parity fails after quantization:Integration with CI/CD
Use parity check as a deployment gate:.github/workflows/deploy.yml
Common Failure Scenarios
max_abs_diff > abs_tol but mean_abs_diff OK
max_abs_diff > abs_tol but mean_abs_diff OK
Diagnosis: Few outlier predictions with large errors.Action:
- Inspect outlier samples (see Identify Outlier Samples)
- Check if outliers have unusual feature values (extreme, rare categories)
- Widen
--abs-tolif outliers are acceptable, or fix preprocessing
mean_abs_diff > mean_tol but max_abs_diff OK
mean_abs_diff > mean_tol but max_abs_diff OK
Diagnosis: Systematic bias across all predictions.Action:
- Check for feature scaling drift (StandardScaler fit on wrong data)
- Verify preprocessing pipeline is identical between sklearn and ONNX
- Re-export ONNX after re-training
Both max_abs_diff and mean_abs_diff exceed tolerances
Both max_abs_diff and mean_abs_diff exceed tolerances
Diagnosis: Major preprocessing mismatch or incorrect ONNX conversion.Action:
- Verify sklearn pipeline structure matches ONNX graph (use
netron) - Check for custom transformers not supported by skl2onnx
- Re-export ONNX with correct initial_types
- If after quantization, check if model is sensitive to INT8 precision
Parity passes on subset but fails on full test set
Parity passes on subset but fails on full test set
Diagnosis: Rare edge cases or data distribution shift.Action:
- Increase
--batch-sizeto 1000+ samples - Stratify test set to ensure coverage of rare categories
- Inspect samples where failure occurs
Best Practices
Run parity check before and after quantization with different tolerances
Use larger batch sizes (1000+) in CI pipelines for comprehensive coverage
Store
parity_report.json as deployment artifact for audit trailRe-tune tolerances when feature engineering changes materially
Combine parity validation with A/B testing in production for full confidence
Next Steps
CPU Inference
Benchmark inference performance after validation
Deployment Overview
Return to deployment workflow overview