Skip to main content
The Qualcomm AI Engine Direct (QNN) execution provider enables efficient inference on Qualcomm Hexagon NPUs found in Snapdragon mobile processors and edge devices.

Requirements

Hardware

  • Qualcomm Snapdragon processors with Hexagon DSP/NPU:
    • Snapdragon 8 Gen 1/2/3 (flagship mobile)
    • Snapdragon 7 series (mid-range mobile)
    • Snapdragon X Elite (Windows on ARM)
    • Qualcomm Robotics platforms

Software

  • Qualcomm Neural Processing SDK (QNN SDK)
  • Android NDK (for Android deployment)
  • Operating Systems:
    • Android 10+
    • Windows on ARM
    • Linux (embedded systems)
QNN provides exceptional power efficiency, making it ideal for mobile and battery-powered edge devices.

Installation

# Install QNN SDK from Qualcomm
# https://www.qualcomm.com/developer/software/neural-processing-sdk

# Build ONNX Runtime GenAI with QNN support
git clone https://github.com/microsoft/onnxruntime-genai.git
cd onnxruntime-genai
python build.py --use_qnn --qnn_home /path/to/qnn/sdk

Basic Configuration

Python API

import onnxruntime_genai as og

model_path = "path/to/model"

# Create config and set QNN provider
config = og.Config(model_path)
config.clear_providers()
config.append_provider("qnn")

# Load model
model = og.Model(config)
tokenizer = og.Tokenizer(model)

# Generate
params = og.GeneratorParams(model)
params.set_search_options(max_length=512)

generator = og.Generator(model, params)

genai_config.json

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "qnn": {}
          }
        ]
      }
    }
  }
}

Memory Management

CPU-Accessible NPU Memory

QNN uses CPU-accessible memory for NPU operations:
// QNN memory is CPU-accessible
struct QnnMemory final : DeviceBuffer {
  QnnMemory(size_t size) : owned_{true} {
    size_in_bytes_ = size;
    p_cpu_ = p_device_ = static_cast<uint8_t*>(ort_allocator_->Alloc(size_in_bytes_));
  }
  
  // No separate device/host transfers needed
  void CopyDeviceToCpu() override {}  // No-op
  void CopyCpuToDevice() override {}  // No-op
};
QNN memory is shared between CPU and NPU, eliminating the need for explicit data transfers and reducing latency.

NPU Configuration

Backend Selection

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("qnn")

# Select QNN backend
config.set_provider_option("qnn", "backend_path", "QnnHtp.so")  # Hexagon backend

model = og.Model(config)

Performance Settings

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "qnn": {
              "backend_path": "QnnHtp.so",
              "htp_performance_mode": "burst",
              "enable_htp_weight_sharing": "1"
            }
          }
        ]
      }
    }
  }
}
{
  "qnn": {
    "htp_performance_mode": "burst"
  }
}
Maximum performance with higher power consumption.

Mobile Deployment

Android Integration

import ai.onnxruntime.genai.*;

public class MainActivity extends AppCompatActivity {
    private Model model;
    private Tokenizer tokenizer;
    
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        
        // Load model with QNN
        String modelPath = getFilesDir() + "/model";
        model = new Model(modelPath);
        tokenizer = new Tokenizer(model);
    }
    
    private void generateText(String prompt) {
        int[] inputTokens = tokenizer.encode(prompt);
        
        GeneratorParams params = new GeneratorParams(model);
        params.setSearchOption("max_length", 256);
        
        Generator generator = new Generator(model, params);
        generator.appendTokens(inputTokens);
        
        while (!generator.isDone()) {
            generator.generateNextToken();
            int[] newTokens = generator.getNextTokens();
            // Process tokens
        }
    }
}

Pipeline Models

QNN supports pipeline models for memory-constrained devices:
{
  "model": {
    "decoder": {
      "pipeline": [
        {
          "filename": "model_part1.onnx",
          "model_id": "part1",
          "session_options": {
            "provider_options": [
              {"qnn": {}}
            ]
          },
          "reset_session_idx": -1
        },
        {
          "filename": "model_part2.onnx",
          "model_id": "part2",
          "session_options": {
            "provider_options": [
              {"qnn": {}}
            ]
          },
          "reset_session_idx": 0
        }
      ]
    }
  }
}
reset_session_idx allows releasing memory from previous pipeline stages, crucial for devices with limited RAM.

Quantization

INT8 Optimization

QNN provides native INT8 support for maximum efficiency:
import onnxruntime_genai as og

# Use INT8 quantized model
model_path = "path/to/quantized_model"

config = og.Config(model_path)
config.clear_providers()
config.append_provider("qnn")

model = og.Model(config)

Precision Configuration

{
  "qnn": {
    "backend_path": "QnnHtp.so",
    "htp_precision": "int8"
  }
}
INT8 quantization on QNN provides:
  • 4x memory reduction
  • 2-4x inference speedup
  • Minimal accuracy loss with proper calibration

Advanced Features

Context Binary Generation

Pre-compile models to context binaries for faster loading:
import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("qnn")

# Enable context binary caching
config.set_provider_option("qnn", "qnn_context_cache_enable", "1")
config.set_provider_option("qnn", "qnn_context_cache_path", "./qnn_cache")

model = og.Model(config)

Device Filtering

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "qnn": {},
            "device_filtering_options": {
              "hardware_device_type": "npu"
            }
          }
        ]
      }
    }
  }
}

Power Management

Battery Optimization

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("qnn")

# Optimize for battery life
config.set_provider_option("qnn", "htp_performance_mode", "power_saver")
config.set_provider_option("qnn", "enable_htp_weight_sharing", "1")

model = og.Model(config)

Thermal Management

# Adjust performance based on thermal state
if device_temperature > threshold:
    config.set_provider_option("qnn", "htp_performance_mode", "balanced")
else:
    config.set_provider_option("qnn", "htp_performance_mode", "burst")

Troubleshooting

QNN SDK Not Found

# Set QNN SDK environment variables
export QNN_SDK_ROOT=/path/to/qnn/sdk
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib:$LD_LIBRARY_PATH

Model Loading Failures

Ensure your ONNX model is compatible with QNN. Not all ONNX operators are supported.
config.set_provider_option("qnn", "backend_path", "QnnHtp.so")
Use correct backend for your platform (QnnHtp.so, QnnCpu.so, etc.).
config.set_provider_option("qnn", "qnn_log_level", "verbose")

Performance Issues

# Enable all optimizations
config.set_provider_option("qnn", "htp_performance_mode", "burst")
config.set_provider_option("qnn", "enable_htp_weight_sharing", "1")
config.set_provider_option("qnn", "qnn_context_cache_enable", "1")

Benchmarking

import time
import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("qnn")
config.set_provider_option("qnn", "backend_path", "QnnHtp.so")

model = og.Model(config)
tokenizer = og.Tokenizer(model)

prompt = "What is AI?"
input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(max_length=100)

start = time.time()
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

token_count = 0
while not generator.is_done():
    generator.generate_next_token()
    token_count += 1

end = time.time()
print(f"Time: {end - start:.2f}s")
print(f"Tokens/sec: {token_count / (end - start):.2f}")
print(f"Energy efficiency: NPU optimized")

Best Practices

Use INT8 Models

Quantize models to INT8 for best NPU performance and power efficiency.

Enable Context Caching

Pre-compile models to context binaries to reduce loading time.

Pipeline Large Models

Split large models into pipeline stages to fit in device memory.

Optimize Performance Mode

Choose performance mode based on battery state and thermal conditions.

Next Steps

Mobile Deployment

Deploy to Android devices

Model Quantization

Optimize models for QNN

Build docs developers (and LLMs) love