The Qualcomm AI Engine Direct (QNN) execution provider enables efficient inference on Qualcomm Hexagon NPUs found in Snapdragon mobile processors and edge devices.
Requirements
Hardware
Qualcomm Snapdragon processors with Hexagon DSP/NPU:
Snapdragon 8 Gen 1/2/3 (flagship mobile)
Snapdragon 7 series (mid-range mobile)
Snapdragon X Elite (Windows on ARM)
Qualcomm Robotics platforms
Software
Qualcomm Neural Processing SDK (QNN SDK)
Android NDK (for Android deployment)
Operating Systems:
Android 10+
Windows on ARM
Linux (embedded systems)
QNN provides exceptional power efficiency, making it ideal for mobile and battery-powered edge devices.
Installation
Build from Source
Android
# Install QNN SDK from Qualcomm
# https://www.qualcomm.com/developer/software/neural-processing-sdk
# Build ONNX Runtime GenAI with QNN support
git clone https://github.com/microsoft/onnxruntime-genai.git
cd onnxruntime-genai
python build.py --use_qnn --qnn_home /path/to/qnn/sdk
# Build for Android
python build.py --android \
--android_abi arm64-v8a \
--use_qnn \
--qnn_home /path/to/qnn/sdk
Basic Configuration
Python API
import onnxruntime_genai as og
model_path = "path/to/model"
# Create config and set QNN provider
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "qnn" )
# Load model
model = og.Model(config)
tokenizer = og.Tokenizer(model)
# Generate
params = og.GeneratorParams(model)
params.set_search_options( max_length = 512 )
generator = og.Generator(model, params)
genai_config.json
{
"model" : {
"decoder" : {
"session_options" : {
"provider_options" : [
{
"qnn" : {}
}
]
}
}
}
}
Memory Management
CPU-Accessible NPU Memory
QNN uses CPU-accessible memory for NPU operations:
// QNN memory is CPU-accessible
struct QnnMemory final : DeviceBuffer {
QnnMemory ( size_t size ) : owned_ { true } {
size_in_bytes_ = size;
p_cpu_ = p_device_ = static_cast < uint8_t *> ( ort_allocator_ -> Alloc (size_in_bytes_));
}
// No separate device/host transfers needed
void CopyDeviceToCpu () override {} // No-op
void CopyCpuToDevice () override {} // No-op
};
QNN memory is shared between CPU and NPU, eliminating the need for explicit data transfers and reducing latency.
NPU Configuration
Backend Selection
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "qnn" )
# Select QNN backend
config.set_provider_option( "qnn" , "backend_path" , "QnnHtp.so" ) # Hexagon backend
model = og.Model(config)
{
"model" : {
"decoder" : {
"session_options" : {
"provider_options" : [
{
"qnn" : {
"backend_path" : "QnnHtp.so" ,
"htp_performance_mode" : "burst" ,
"enable_htp_weight_sharing" : "1"
}
}
]
}
}
}
}
Burst Mode
Balanced Mode
Power Saver
Sustained
{
"qnn" : {
"htp_performance_mode" : "burst"
}
}
Maximum performance with higher power consumption. {
"qnn" : {
"htp_performance_mode" : "balanced"
}
}
Balanced performance and power efficiency. {
"qnn" : {
"htp_performance_mode" : "power_saver"
}
}
Optimized for battery life with reduced performance. {
"qnn" : {
"htp_performance_mode" : "sustained_high_performance"
}
}
Sustained high performance for extended workloads.
Mobile Deployment
Android Integration
import ai.onnxruntime.genai. * ;
public class MainActivity extends AppCompatActivity {
private Model model ;
private Tokenizer tokenizer ;
@ Override
protected void onCreate ( Bundle savedInstanceState ) {
super . onCreate (savedInstanceState);
// Load model with QNN
String modelPath = getFilesDir () + "/model" ;
model = new Model (modelPath);
tokenizer = new Tokenizer (model);
}
private void generateText ( String prompt ) {
int [] inputTokens = tokenizer . encode (prompt);
GeneratorParams params = new GeneratorParams (model);
params . setSearchOption ( "max_length" , 256 );
Generator generator = new Generator (model, params);
generator . appendTokens (inputTokens);
while ( ! generator . isDone ()) {
generator . generateNextToken ();
int [] newTokens = generator . getNextTokens ();
// Process tokens
}
}
}
Pipeline Models
QNN supports pipeline models for memory-constrained devices:
{
"model" : {
"decoder" : {
"pipeline" : [
{
"filename" : "model_part1.onnx" ,
"model_id" : "part1" ,
"session_options" : {
"provider_options" : [
{ "qnn" : {}}
]
},
"reset_session_idx" : -1
},
{
"filename" : "model_part2.onnx" ,
"model_id" : "part2" ,
"session_options" : {
"provider_options" : [
{ "qnn" : {}}
]
},
"reset_session_idx" : 0
}
]
}
}
}
reset_session_idx allows releasing memory from previous pipeline stages, crucial for devices with limited RAM.
Quantization
INT8 Optimization
QNN provides native INT8 support for maximum efficiency:
import onnxruntime_genai as og
# Use INT8 quantized model
model_path = "path/to/quantized_model"
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "qnn" )
model = og.Model(config)
Precision Configuration
{
"qnn" : {
"backend_path" : "QnnHtp.so" ,
"htp_precision" : "int8"
}
}
INT8 quantization on QNN provides:
4x memory reduction
2-4x inference speedup
Minimal accuracy loss with proper calibration
Advanced Features
Context Binary Generation
Pre-compile models to context binaries for faster loading:
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "qnn" )
# Enable context binary caching
config.set_provider_option( "qnn" , "qnn_context_cache_enable" , "1" )
config.set_provider_option( "qnn" , "qnn_context_cache_path" , "./qnn_cache" )
model = og.Model(config)
Device Filtering
{
"model" : {
"decoder" : {
"session_options" : {
"provider_options" : [
{
"qnn" : {},
"device_filtering_options" : {
"hardware_device_type" : "npu"
}
}
]
}
}
}
}
Power Management
Battery Optimization
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "qnn" )
# Optimize for battery life
config.set_provider_option( "qnn" , "htp_performance_mode" , "power_saver" )
config.set_provider_option( "qnn" , "enable_htp_weight_sharing" , "1" )
model = og.Model(config)
Thermal Management
# Adjust performance based on thermal state
if device_temperature > threshold:
config.set_provider_option( "qnn" , "htp_performance_mode" , "balanced" )
else :
config.set_provider_option( "qnn" , "htp_performance_mode" , "burst" )
Troubleshooting
QNN SDK Not Found
# Set QNN SDK environment variables
export QNN_SDK_ROOT = / path / to / qnn / sdk
export LD_LIBRARY_PATH = $QNN_SDK_ROOT / lib : $LD_LIBRARY_PATH
Model Loading Failures
Check Model Compatibility
Ensure your ONNX model is compatible with QNN. Not all ONNX operators are supported.
config.set_provider_option( "qnn" , "backend_path" , "QnnHtp.so" )
Use correct backend for your platform (QnnHtp.so, QnnCpu.so, etc.).
config.set_provider_option( "qnn" , "qnn_log_level" , "verbose" )
# Enable all optimizations
config.set_provider_option( "qnn" , "htp_performance_mode" , "burst" )
config.set_provider_option( "qnn" , "enable_htp_weight_sharing" , "1" )
config.set_provider_option( "qnn" , "qnn_context_cache_enable" , "1" )
Benchmarking
import time
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "qnn" )
config.set_provider_option( "qnn" , "backend_path" , "QnnHtp.so" )
model = og.Model(config)
tokenizer = og.Tokenizer(model)
prompt = "What is AI?"
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options( max_length = 100 )
start = time.time()
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)
token_count = 0
while not generator.is_done():
generator.generate_next_token()
token_count += 1
end = time.time()
print ( f "Time: { end - start :.2f} s" )
print ( f "Tokens/sec: { token_count / (end - start) :.2f} " )
print ( f "Energy efficiency: NPU optimized" )
Best Practices
Use INT8 Models Quantize models to INT8 for best NPU performance and power efficiency.
Enable Context Caching Pre-compile models to context binaries to reduce loading time.
Pipeline Large Models Split large models into pipeline stages to fit in device memory.
Optimize Performance Mode Choose performance mode based on battery state and thermal conditions.
Next Steps
Mobile Deployment Deploy to Android devices
Model Quantization Optimize models for QNN