Skip to main content

QNN Execution Provider

The QNN (Qualcomm Neural Network) Execution Provider enables hardware-accelerated inference on Qualcomm platforms, including Snapdragon mobile processors, IoT devices, and edge compute platforms.

When to Use QNN EP

Use the QNN Execution Provider when:
  • You’re deploying on Android devices with Qualcomm Snapdragon processors
  • You need to leverage Qualcomm’s AI accelerators (Hexagon DSP, AI Engine)
  • You’re building IoT or edge devices with Qualcomm chipsets
  • You want optimized inference on Qualcomm compute platforms
  • You need low-power, high-performance inference on mobile

Key Features

  • Hexagon DSP: Leverage dedicated signal processing hardware
  • AI Engine: Access specialized neural network accelerators
  • Multi-Core Optimization: Utilize multiple compute units efficiently
  • Low Power: Optimized for battery-powered devices
  • Quantization Support: INT8 and FP16 precision modes
  • Android Integration: Seamless deployment on Android devices

Prerequisites

Hardware Requirements

Supported Chipsets:
  • Snapdragon 8 Gen 2/3 (flagship smartphones)
  • Snapdragon 7 Series (upper mid-range)
  • Snapdragon 6 Series (mid-range)
  • Snapdragon 8cx Gen 3 (Windows on ARM)
  • Qualcomm IoT and Edge platforms
Recommended:
  • Snapdragon 888 or newer for best performance
  • Devices with Hexagon 698 DSP or newer

Software Requirements

  • Qualcomm Neural Processing SDK (QNN SDK)
  • Android NDK (for Android deployment)
  • ONNX Runtime with QNN support
  • Android API Level 29+ (Android 10+)

Installation

Android (Java/Kotlin)

// app/build.gradle
dependencies {
    implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.17.0'
}

Android (Native C++)

# CMakeLists.txt
add_library(onnxruntime SHARED IMPORTED)
set_target_properties(onnxruntime PROPERTIES
    IMPORTED_LOCATION ${ONNXRUNTIME_LIB_DIR}/libonnxruntime.so
)

target_link_libraries(your_app
    onnxruntime
)

Python (Linux/Development)

# Install ONNX Runtime with QNN support
# Note: QNN support requires special build
pip install onnxruntime

# Download Qualcomm QNN SDK
# https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk

Build from Source

# Clone ONNX Runtime
git clone https://github.com/microsoft/onnxruntime.git
cd onnxruntime

# Set QNN SDK path
export QNN_SDK_ROOT=/path/to/qnn-sdk

# Build with QNN support for Android
./build.sh --config Release \
    --android \
    --android_abi arm64-v8a \
    --android_api 29 \
    --use_qnn \
    --qnn_home $QNN_SDK_ROOT \
    --build_shared_lib

Basic Usage

Java/Kotlin (Android)

import ai.onnxruntime.*

// Create session options
val sessionOptions = OrtSession.SessionOptions()

// Add QNN provider
sessionOptions.addQNN()

// Create environment and session
val env = OrtEnvironment.getEnvironment()
val session = env.createSession(
    context.assets.open("model.onnx").readBytes(),
    sessionOptions
)

// Prepare input
val inputName = session.inputNames.iterator().next()
val inputShape = longArrayOf(1, 3, 224, 224)
val inputBuffer = FloatArray(1 * 3 * 224 * 224) { /* fill with data */ }
val inputTensor = OnnxTensor.createTensor(
    env,
    FloatBuffer.wrap(inputBuffer),
    inputShape
)

// Run inference
val inputs = mapOf(inputName to inputTensor)
val outputs = session.run(inputs)

// Get result
val output = outputs[0].value as FloatBuffer

// Clean up
inputTensor.close()
outputs.close()

C++ (Android NDK)

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "QNNExample");
Ort::SessionOptions session_options;

// Configure QNN provider
std::unordered_map<std::string, std::string> qnn_options;
qnn_options["backend_path"] = "libQnnHtp.so";  // Hexagon backend
qnn_options["qnn_context_priority"] = "high";

session_options.AppendExecutionProvider("QNN", qnn_options);

// Create session
Ort::Session session(env, "model.onnx", session_options);

// Run inference
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, 
                                   input_names.data(), 
                                   &input_tensor, 1,
                                   output_names.data(), 1);

Python (Linux)

import onnxruntime as ort
import numpy as np

# Create session with QNN provider
session = ort.InferenceSession(
    "model.onnx",
    providers=[
        ('QNNExecutionProvider', {
            'backend_path': 'libQnnHtp.so',
            'qnn_context_priority': 'high'
        }),
        'CPUExecutionProvider'
    ]
)

# Prepare input
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
results = session.run(None, {input_name: x})

Configuration Options

Backend Selection

QNN supports multiple hardware backends:
// Hexagon DSP (best performance)
qnn_options["backend_path"] = "libQnnHtp.so";

// CPU backend (fallback, debugging)
qnn_options["backend_path"] = "libQnnCpu.so";

// GPU backend
qnn_options["backend_path"] = "libQnnGpu.so";
// Kotlin
val options = mapOf(
    "backend_path" to "libQnnHtp.so"
)
sessionOptions.addExecutionProvider("QNN", options)

Priority Settings

// High priority for latency-critical tasks
qnn_options["qnn_context_priority"] = "high";

// Normal priority (default)
qnn_options["qnn_context_priority"] = "normal";

// Low priority for background tasks
qnn_options["qnn_context_priority"] = "low";

Profiling

// Enable profiling for performance analysis
qnn_options["profiling_level"] = "basic";  // or "detailed"
qnn_options["enable_htp_fp16_precision"] = "1";  // FP16 mode

Advanced Options

std::unordered_map<std::string, std::string> qnn_options;

// Backend configuration
qnn_options["backend_path"] = "libQnnHtp.so";
qnn_options["qnn_context_priority"] = "high";

// Performance tuning
qnn_options["enable_htp_fp16_precision"] = "1";
qnn_options["htp_performance_mode"] = "burst";  // sustained_high_performance, burst, power_saver, balanced

// Context configuration
qnn_options["qnn_saver_path"] = "/data/local/tmp/qnn_context";
qnn_options["enable_htp_weight_sharing"] = "1";

// Debugging
qnn_options["profiling_level"] = "basic";
qnn_options["rpc_control_latency"] = "100";  // microseconds

Performance Optimization

Quantization

QNN performs best with quantized models:
import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic

# Quantize model to INT8
quantize_dynamic(
    "model.onnx",
    "model_int8.onnx",
    weight_type=ort.QuantType.QInt8
)

# Use quantized model with QNN
session = ort.InferenceSession(
    "model_int8.onnx",
    providers=[('QNNExecutionProvider', {
        'backend_path': 'libQnnHtp.so'
    })]
)

Performance Modes

// Maximum performance (high power)
qnn_options["htp_performance_mode"] = "burst";

// Sustained high performance
qnn_options["htp_performance_mode"] = "sustained_high_performance";

// Balanced (default)
qnn_options["htp_performance_mode"] = "balanced";

// Power saving
qnn_options["htp_performance_mode"] = "power_saver";

Context Caching

Save compiled contexts for faster initialization:
// First run: compile and save context
qnn_options["qnn_saver_path"] = "/data/local/tmp/model_context";
qnn_options["qnn_context_cache_enable"] = "1";

Ort::Session session(env, "model.onnx", session_options);
session.Run(/* ... */);  // Compiles and saves context

// Subsequent runs: load from cache (much faster)
qnn_options["qnn_context_cache_path"] = "/data/local/tmp/model_context";
Ort::Session cached_session(env, "model.onnx", session_options);

FP16 Precision

Enable FP16 for better performance:
qnn_options["enable_htp_fp16_precision"] = "1";

Android Integration

Complete Android Example

import ai.onnxruntime.*
import android.content.Context
import kotlinx.coroutines.*

class ModelInference(private val context: Context) {
    private lateinit var env: OrtEnvironment
    private lateinit var session: OrtSession
    
    suspend fun initialize() = withContext(Dispatchers.IO) {
        env = OrtEnvironment.getEnvironment()
        
        val sessionOptions = OrtSession.SessionOptions().apply {
            // Configure QNN
            val qnnOptions = mapOf(
                "backend_path" to "libQnnHtp.so",
                "qnn_context_priority" to "high",
                "enable_htp_fp16_precision" to "1"
            )
            addExecutionProvider("QNN", qnnOptions)
            
            // Additional optimizations
            setIntraOpNumThreads(4)
            setGraphOptimizationLevel(OrtSession.SessionOptions.OptLevel.ALL_OPT)
        }
        
        // Load model from assets
        val modelBytes = context.assets.open("model.onnx").readBytes()
        session = env.createSession(modelBytes, sessionOptions)
    }
    
    suspend fun runInference(input: FloatArray): FloatArray = withContext(Dispatchers.Default) {
        val inputName = session.inputNames.first()
        val inputShape = longArrayOf(1, 3, 224, 224)
        
        // Create input tensor
        val inputTensor = OnnxTensor.createTensor(
            env,
            java.nio.FloatBuffer.wrap(input),
            inputShape
        )
        
        // Run inference
        val outputs = session.run(mapOf(inputName to inputTensor))
        
        // Extract result
        val output = outputs[0].value as java.nio.FloatBuffer
        val result = FloatArray(output.remaining())
        output.get(result)
        
        // Clean up
        inputTensor.close()
        outputs.close()
        
        result
    }
    
    fun close() {
        session.close()
        env.close()
    }
}

Permissions (AndroidManifest.xml)

<!-- No special permissions required for QNN -->
<!-- Optional: for loading models from external storage -->
<uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE" />

Asset Packaging

// app/build.gradle
android {
    // ... other config ...
    
    aaptOptions {
        noCompress "onnx"
    }
}

Platform Support

PlatformArchitectureSupportNotes
AndroidARM64✅ FullPrimary platform
AndroidARMv7⚠️ LimitedOlder devices
LinuxARM64✅ LimitedDevelopment/testing
Windows on ARMARM64✅ LimitedSnapdragon PCs
Linuxx64❌ NoUse CPU/CUDA instead

Supported Chipsets

Flagship (Best Performance)

  • Snapdragon 8 Gen 3
  • Snapdragon 8 Gen 2
  • Snapdragon 888/888+
  • Snapdragon 8+ Gen 1

Upper Mid-Range

  • Snapdragon 7 Gen 1/2
  • Snapdragon 778G/782G
  • Snapdragon 870

Mid-Range

  • Snapdragon 695/690
  • Snapdragon 6 Gen 1

Edge/IoT

  • Snapdragon 660/665
  • Qualcomm IoT platforms

Troubleshooting

Provider Not Available

// Check if QNN is available
val providers = OrtEnvironment.getEnvironment().availableProviders
if ("QNN" !in providers) {
    Log.w("QNN", "QNN provider not available")
    // Fallback to CPU
}

Backend Loading Errors

try {
    val options = mapOf("backend_path" to "libQnnHtp.so")
    sessionOptions.addExecutionProvider("QNN", options)
} catch (e: Exception) {
    Log.e("QNN", "Failed to load QNN backend: ${e.message}")
    // Try CPU backend as fallback
    val cpuOptions = mapOf("backend_path" to "libQnnCpu.so")
    sessionOptions.addExecutionProvider("QNN", cpuOptions)
}

Performance Issues

// Enable profiling to identify bottlenecks
qnn_options["profiling_level"] = "detailed";
qnn_options["enable_htp_fp16_precision"] = "1";
qnn_options["htp_performance_mode"] = "burst";

// Check logs for performance hints
// adb logcat | grep QNN

Context Save/Load Errors

# Ensure directory has correct permissions
adb shell mkdir -p /data/local/tmp/qnn_context
adb shell chmod 777 /data/local/tmp/qnn_context

# Check available space
adb shell df /data/local/tmp

Performance Comparison

Typical performance on Snapdragon 888:
ConfigurationLatencyPowerNotes
CPU Only80msHighBaseline
QNN (FP32)15msMediumGood
QNN (FP16)8msLowBetter
QNN (INT8)4msVery LowBest

Best Practices

  1. Use Quantization: INT8 models run 2-4x faster
  2. Cache Contexts: Save compiled contexts to reduce init time
  3. Enable FP16: Minimal accuracy impact, significant speedup
  4. Profile First: Use profiling to identify bottlenecks
  5. Test on Device: Performance varies by chipset generation

Next Steps