Overview
ONNX Runtime provides flexible threading options to optimize performance on multi-core systems. This guide covers thread pool configuration, intra-op and inter-op parallelism, and best practices for concurrent execution.Threading Architecture
ONNX Runtime supports two threading implementations:- ORT Thread Pool: Custom thread pool implementation (default)
- OpenMP: Industry-standard parallel programming framework (opt-in at build time)
--use_openmp flag.
Thread Pool Types
Intra-Op Thread Pool
Parallelism within a single operator:- Matrix multiplications
- Convolution operations
- Element-wise operations on large tensors
Inter-Op Thread Pool
Parallelism between independent operators:- Models with parallel branches
- Independent operations in the graph
- Pipeline parallelism
Configuration Examples
CPU-Bound Workloads
Models with Parallel Branches
High-Throughput Server
Execution Modes
Sequential Execution
- Lower scheduling overhead
- Operators execute one at a time
- Better for simple, linear graphs
- Default mode for most scenarios
Parallel Execution
- Higher parallelism between operators
- Better for complex graphs with independent paths
- Higher scheduling overhead
- Requires inter-op thread pool
C++ API
Basic Configuration
Custom Thread Pool
Threading Abstractions for Op Developers
ONNX Runtime provides abstractions for implementing parallel operators:TryParallelFor
TrySimpleParallelFor
Simplified version for uniform work:TryBatchParallelFor
For batched operations:ShouldParallelize
Check if parallelization is beneficial:DegreeOfParallelism
Get available parallelism:ParallelSection
Group multiple loops in a single parallel section:OpenMP vs ORT Thread Pool
Building with OpenMP
When to Use OpenMP
Advantages:- Industry-standard parallelization
- Mature optimization
- Good for CPU-intensive ops
- May conflict with application-level OpenMP
- Less control over thread pool
- Build-time decision
When to Use ORT Thread Pool
Advantages:- Full control over threading
- No conflicts with application threads
- Consistent behavior across platforms
- Runtime configuration
- Custom threading requirements
- Embedding in existing applications
- Fine-grained control needed
Best Practices
1. Match Thread Count to Hardware
2. Avoid Over-subscription
3. Start with Sequential Mode
4. Tune for Your Workload
5. Set Environment Variables
Control system-level threading:6. Concurrent Inference
For concurrent requests, limit per-session threads:Platform-Specific Considerations
Linux
Windows
macOS
Troubleshooting
Poor CPU Utilization
Symptoms: Low CPU usage during inference Solutions:- Increase intra-op threads
- Enable parallel execution mode
- Check for I/O bottlenecks
Thread Contention
Symptoms: Performance degrades with more threads Solutions:- Reduce thread count
- Use sequential execution
- Profile for lock contention
Inconsistent Performance
Symptoms: High latency variance Solutions:- Fix thread count (don’t use default)
- Disable dynamic threading
- Pin to physical cores