The CUDA execution provider enables high-performance inference on NVIDIA GPUs with optimized kernels for generative AI workloads.
Requirements
Hardware
NVIDIA GPU with Compute Capability 6.0 or higher
Recommended: RTX 20 series or newer for optimal performance
Minimum: 4GB GPU memory (varies by model size)
Software
CUDA Toolkit 11.8 or 12.x
cuDNN 8.x or later
NVIDIA driver 520.61.05 or newer (Linux) / 528.33 or newer (Windows)
The CUDA provider is included in the onnxruntime-genai-cuda package.
Installation
pip install onnxruntime-genai-cuda --pre
Download the CUDA package from the releases page . dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda
Basic Configuration
Python API
import onnxruntime_genai as og
model_path = "path/to/model"
# Create config and set CUDA provider
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "cuda" )
# Load model
model = og.Model(config)
tokenizer = og.Tokenizer(model)
# Generate
params = og.GeneratorParams(model)
params.set_search_options( max_length = 1024 )
generator = og.Generator(model, params)
genai_config.json
Configure CUDA in your model configuration:
{
"model" : {
"decoder" : {
"session_options" : {
"provider_options" : [
{
"cuda" : {}
}
]
}
}
}
}
GPU Memory Management
Memory Allocation
The CUDA provider uses ONNX Runtime’s allocator for GPU memory:
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "cuda" )
# Enable memory arena for better performance
config.set_provider_option( "cuda" , "enable_cuda_graph" , "1" )
config.set_provider_option( "cuda" , "gpu_mem_limit" , "4294967296" ) # 4GB
model = og.Model(config)
GPU Memory Limit : Set gpu_mem_limit to restrict CUDA memory usage (in bytes). This is useful for multi-tenant scenarios.
KV Cache Management
ONNX Runtime GenAI optimizes key-value cache management:
params = og.GeneratorParams(model)
# Configure cache sharing for better memory efficiency
params.set_search_options(
max_length = 2048 ,
past_present_share_buffer = True # Reduces memory usage
)
When using beam search (num_beams > 1), past_present_share_buffer must be set to False. CUDA graph is also incompatible with beam search.
CUDA Graph Optimization
Enable CUDA graphs to reduce kernel launch overhead:
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "cuda" )
# Enable CUDA graph (for greedy search only)
config.set_provider_option( "cuda" , "enable_cuda_graph" , "1" )
model = og.Model(config)
CUDA graphs are only compatible with greedy search (num_beams=1). Disable for beam search scenarios.
Multi-Profile Support
Optimize for multiple sequence lengths:
{
"model" : {
"decoder" : {
"session_options" : {
"enable_cuda_graph" : "1" ,
"cuda_graph_enable_multi_profile" : "1"
}
}
}
}
Stream Configuration
The CUDA provider uses a dedicated stream for async operations. Memory transfers are optimized using pinned host memory:
// C++ example of stream usage
void* GetStream () {
return g_stream . get ();
}
// Async memory copy
cudaMemcpyAsync (dst, src, size, cudaMemcpyDeviceToHost, GetStream ());
cudaStreamSynchronize ( GetStream ());
Precision Options
FP16 Inference
The CUDA provider supports FP16 for reduced memory and faster inference:
# Use FP16 model variant
model_path = "model-fp16-cuda"
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "cuda" )
model = og.Model(config)
Memory : Higher usage
Speed : Baseline performance
Precision : Full precision
Hardware : All CUDA GPUs
Memory : 50% reduction
Speed : 2-3x faster on Tensor Cores
Precision : Mixed precision
Hardware : Compute Capability 7.0+ (Volta, Turing, Ampere, Ada)
Advanced Configuration
Device Selection
Select a specific GPU in multi-GPU systems:
import os
# Set CUDA device before loading model
os.environ[ "CUDA_VISIBLE_DEVICES" ] = "0" # Use first GPU
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "cuda" )
model = og.Model(config)
Session Options
Fine-tune ONNX Runtime session settings:
{
"model" : {
"decoder" : {
"session_options" : {
"enable_mem_pattern" : true ,
"enable_cpu_mem_arena" : false ,
"intra_op_num_threads" : 1 ,
"provider_options" : [
{
"cuda" : {
"gpu_mem_limit" : "8589934592" ,
"arena_extend_strategy" : "kSameAsRequested"
}
}
]
}
}
}
}
Troubleshooting
Out of Memory Errors
# Reduce memory usage
config.set_provider_option( "cuda" , "gpu_mem_limit" , "4294967296" )
# Use smaller batch sizes
params.set_search_options( batch_size = 1 )
# Reduce max length
params.set_search_options( max_length = 512 )
config.set_provider_option( "cuda" , "enable_cuda_graph" , "1" )
Switch to FP16 model variants for Tensor Core acceleration.
params.set_search_options( past_present_share_buffer = True )
Driver Compatibility
Verify CUDA installation:
# Check CUDA version
nvcc --version
# Verify driver
nvidia-smi
# Test CUDA availability
python -c "import onnxruntime_genai as og; print('CUDA available')"
Benchmarking
import time
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "cuda" )
model = og.Model(config)
tokenizer = og.Tokenizer(model)
prompt = "Tell me about AI"
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options( max_length = 100 )
start = time.time()
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)
while not generator.is_done():
generator.generate_next_token()
end = time.time()
print ( f "Generation time: { end - start :.2f} s" )
print ( f "Tokens/sec: { 100 / (end - start) :.2f} " )
Next Steps
Model Optimization Optimize models for CUDA deployment
Memory Management Learn advanced memory techniques