Available Backends
Retto uses ONNX Runtime under the hood, supporting multiple execution providers:
CPU
Universal fallback, works everywhere
CUDA
NVIDIA GPU acceleration
DirectML
Windows GPU acceleration
WebAssembly
Browser-based execution
CPU Backend
The CPU backend is always available and requires no special configuration:
[dependencies]
retto-core = { version = "0.1.5", features = ["backend-ort"] }
use retto_core::prelude::*;
let cfg = RettoSessionConfig {
worker_config: RettoOrtWorkerConfig {
device: RettoOrtWorkerDevice::CPU,
models: RettoOrtWorkerModelProvider::default(),
},
..Default::default()
};
The CPU backend uses optimized SIMD instructions on x86_64 and ARM architectures.
CUDA Backend
Accelerate inference with NVIDIA GPUs using CUDA.
Requirements
- NVIDIA GPU with CUDA support
- CUDA Toolkit 11.x or 12.x
- cuDNN library
Enable Feature
[dependencies]
retto-core = { version = "0.1.5", features = ["backend-ort-cuda"] }
For the CLI:
cargo build --release --features backend-ort-cuda
use retto_core::prelude::*;
let cfg = RettoSessionConfig {
worker_config: RettoOrtWorkerConfig {
device: RettoOrtWorkerDevice::Cuda(0), // Use GPU 0
models: RettoOrtWorkerModelProvider::default(),
},
..Default::default()
};
let mut session = RettoSession::new(cfg)?;
Multi-GPU Selection
Specify which GPU to use:
// Use the first GPU
device: RettoOrtWorkerDevice::Cuda(0)
// Use the second GPU
device: RettoOrtWorkerDevice::Cuda(1)
CUDA Configuration
From ort_worker.rs:154, the CUDA execution provider is configured with:
RettoOrtWorkerDevice::Cuda(id) => providers.push(
CUDAExecutionProvider::default()
.with_arena_extend_strategy(NextPowerOfTwo)
.with_conv_algorithm_search(Exhaustive)
.with_device_id(id)
.build(),
)
Arena Strategy
Conv Algorithm
Memory allocation uses NextPowerOfTwo strategy for efficient GPU memory management.
Exhaustive search finds the fastest convolution algorithm for your specific GPU.
DirectML Backend
DirectML provides GPU acceleration on Windows for AMD, Intel, and NVIDIA GPUs.
Requirements
- Windows 10 version 1903 or later
- DirectX 12 capable GPU
Enable Feature
[dependencies]
retto-core = { version = "0.1.5", features = ["backend-ort-directml"] }
use retto_core::prelude::*;
let cfg = RettoSessionConfig {
worker_config: RettoOrtWorkerConfig {
device: RettoOrtWorkerDevice::DirectML(0),
models: RettoOrtWorkerModelProvider::default(),
},
..Default::default()
};
DirectML Configuration
From ort_worker.rs:162, DirectML setup:
RettoOrtWorkerDevice::DirectML(id) => providers.push(
DirectMLExecutionProvider::default()
.with_device_id(id)
.build(),
)
DirectML is particularly useful for AMD GPUs on Windows, which don’t support CUDA.
WebAssembly Backend
Run OCR in web browsers using WebAssembly.
Enable Feature
[dependencies]
retto-core = { version = "0.1.5", features = ["backend-ort-wasm"] }
Initialization
From ort_worker.rs:144, WASM requires explicit initialization:
#[cfg(target_family = "wasm")]
{
ort::init()
.with_global_thread_pool(ort::environment::GlobalThreadPoolOptions::default())
.commit()
.expect("Cannot initialize ort.");
}
WebAssembly execution is significantly slower than native backends. Consider server-side processing for production use.
Backend Selection Logic
From ort_worker.rs:152, Retto uses a fallback chain:
let mut providers = Vec::new();
// Try GPU backends first
match cfg.device {
RettoOrtWorkerDevice::Cuda(id) => {
providers.push(CUDAExecutionProvider::build());
}
RettoOrtWorkerDevice::DirectML(id) => {
providers.push(DirectMLExecutionProvider::build());
}
_ => {}
};
// Always add CPU as fallback
providers.push(CPUExecutionProvider::default().build());
If the GPU backend fails to initialize, execution automatically falls back to CPU.
CPU
CUDA
DirectML
WebAssembly
Performance: Baseline (1x)Pros:
- Universal compatibility
- No extra dependencies
- Stable and reliable
Cons:
- Slower inference
- Limited parallelism
Use Cases:
- Development
- Low-volume processing
- Deployment without GPU access
Performance: 5-20x faster than CPUPros:
- Excellent performance
- Wide hardware support
- Mature ecosystem
Cons:
- NVIDIA GPUs only
- CUDA Toolkit required
- Higher setup complexity
Use Cases:
- High-volume batch processing
- Real-time applications
- Data centers with NVIDIA GPUs
Performance: 3-10x faster than CPUPros:
- Works with AMD/Intel/NVIDIA GPUs
- Built into Windows
- Easy setup
Cons:
- Windows only
- Slightly slower than CUDA
- Less optimization
Use Cases:
- Windows desktop applications
- AMD GPU systems
- Cross-vendor GPU support
Performance: 0.1-0.3x CPU speedPros:
- Runs in browsers
- No server required
- Privacy-preserving
Cons:
- Very slow
- Limited by JavaScript
- Large WASM binary
Use Cases:
- Client-side demos
- Privacy-sensitive applications
- Offline web apps
CLI Backend Selection
The CLI supports runtime backend selection:
retto-cli --device cpu --images ./photos/
From main.rs:46, the CLI parses device configuration:
let device = match cli.device {
DeviceKind::Cpu => RettoOrtWorkerDevice::CPU,
#[cfg(feature = "backend-ort-cuda")]
DeviceKind::Cuda => RettoOrtWorkerDevice::Cuda(cli.device_id),
#[cfg(feature = "backend-ort-directml")]
DeviceKind::DirectMl => RettoOrtWorkerDevice::DirectML(cli.device_id),
};
Choosing the Right Backend
Start with CPU
Begin development with the CPU backend for maximum compatibility.
Profile your workload
Measure throughput and latency requirements.
Enable GPU if needed
- CUDA for NVIDIA GPUs and maximum performance
- DirectML for Windows with any GPU
- WASM for browser-based processing
Optimize configuration
Tune device IDs, batch sizes, and threading for your specific hardware.
Troubleshooting
CUDA Initialization Fails
Error: CUDA execution provider is not available
Solutions:
- Verify CUDA Toolkit installation:
nvcc --version
- Check GPU availability:
nvidia-smi
- Ensure cuDNN is installed
- Rebuild with
--features backend-ort-cuda
DirectML Not Available
Error: DirectML execution provider is not available
Solutions:
- Update Windows to version 1903+
- Verify GPU drivers are up to date
- Check DirectX 12 support
WASM Out of Memory
Error: Failed to allocate memory
Solutions:
- Reduce image size before processing
- Use lower resolution models
- Increase browser memory limits
Environment Variables
Control ONNX Runtime behavior:
# Enable debug logging
export ORT_LOG_LEVEL=VERBOSE
# Set thread count
export ORT_NUM_THREADS=4
# Disable telemetry
export ORT_DISABLE_TELEMETRY=1
Next Steps
Model Loading
Optimize model loading for your backend
Rust Usage
Build applications with Retto