oneAPI Backend

Overview

The oneAPI backend enables deployment of neural networks on Intel/Altera FPGAs using the Intel oneAPI DPC++/SYCL compiler. It is the modern replacement for the deprecated Quartus backend, offering better streaming support and integration with Intel’s oneAPI ecosystem.

When to Use oneAPI Backend

Intel/Altera FPGAs: Target Agilex, Stratix 10, or newer devices
Modern development: Use latest Intel FPGA tools and workflows
Streaming architectures: Better io_stream support with task parallelism
Python integration: Seamless integration with Python runtime

The oneAPI backend is actively developed and is the recommended choice for Intel FPGA projects.

Installation and Setup

Prerequisites

Intel oneAPI Base Toolkit with DPC++ compiler
Intel Quartus Prime for FPGA synthesis
Python 3.8 or higher
hls4ml library installed
CMake 3.10 or higher

Environment Setup

# Source oneAPI environment
source /opt/intel/oneapi/setvars.sh

# Verify compiler is available
which icpx
icpx --version

# Verify Quartus (optional, for FPGA synthesis)
which quartus_sh

Configuration

Basic Configuration

Create a model configuration for the oneAPI backend:

import hls4ml

config = hls4ml.utils.config_from_keras_model(
    model,
    granularity='name',
    backend='oneAPI'
)

# Convert model
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=config,
    output_dir='my_oneapi_project',
    backend='oneAPI',
    part='Agilex7',
    clock_period=5,
    io_type='io_parallel'
)

Configuration Options

part

string

default:"Agilex7"

FPGA device family:

Agilex7
Agilex
Stratix10
Arria10

clock_period

int

default:"5"

Clock period in nanoseconds (5ns = 200MHz)

hyperopt_handshake

bool

default:"false"

Enable hyper-optimized handshaking between kernels

io_type

string

default:"io_parallel"

I/O implementation type:

io_parallel: Single task with pipelining
io_stream: Multiple tasks with pipes

write_tar

bool

default:"false"

Compress output directory into .tar.gz file

Layer Configuration

The oneAPI backend only supports Resource strategy. There is no Latency implementation.

Dense Layers

config['dense_layer'] = {
    'ReuseFactor': 16,
    'Strategy': 'Resource',  # Only Resource supported
    'Precision': 'ac_fixed<16,6,true>',
    'accum_t': 'ac_fixed<24,12,true>'
}

Convolutional Layers

config['conv2d_layer'] = {
    'ReuseFactor': 8,
    'ParallelizationFactor': 1,
    'Implementation': 'im2col',  # or 'Winograd', 'combination'
    'Precision': 'ac_fixed<16,6,true>'
}

Convolution Implementations:

im2col: Image-to-column + matrix multiply (default)
Winograd: Fast convolution for 3x3 filters
combination: Compile-time selection

Recurrent Layers

config['lstm_layer'] = {
    'ReuseFactor': 1,
    'RecurrentReuseFactor': 1,
    'Strategy': 'Resource',
    'table_size': 1024,
    'table_t': 'ac_fixed<18,8,true>'
}

Build Process

Build Commands

The oneAPI backend uses CMake for building:

# Compile the model
hls_model.compile()

# Build for different targets
report = hls_model.build(
    build_type='fpga_emu',  # Emulation
    run=False  # Run after build
)

# Available build types:
# - 'fpga_emu': Fast emulation on CPU
# - 'fpga_sim': Accurate RTL simulation
# - 'report': Generate optimization reports
# - 'fpga': Full FPGA compilation
# - 'lib': Python-callable library

Build Targets

oneAPI Build Targets

Target	Description	Time	Accuracy
`fpga_emu`	CPU emulation	Seconds	Functional
`fpga_sim`	RTL simulation	Minutes	Cycle-accurate
`report`	Optimization report	Minutes	Area/performance estimates
`fpga`	Full FPGA compile	Hours	Exact
`lib`	Shared library	Minutes	Functional

CMake Build System

cd my_oneapi_project
mkdir -p build
cd build

# Configure
cmake ..

# Build targets
make fpga_emu    # Emulation
make report      # Reports
make fpga_sim    # Simulation
make fpga        # FPGA bitfile
make lib         # Python library

# Run emulation
./myproject.fpga_emu

Example Project Structure

my_oneapi_project/
├── firmware/
│   ├── myproject.cpp          # SYCL kernel implementation
│   ├── myproject.h            # Header declarations
│   ├── parameters.h           # Network parameters
│   ├── defines.h              # Macro definitions
│   ├── weights/               # Weight data files
│   └── nnet_utils/            # Utility functions
├── tb_data/
│   ├── tb_input_features.dat
│   └── tb_output_predictions.dat
├── myproject_test.cpp         # Host code (testbench)
├── CMakeLists.txt             # CMake configuration
├── build/                     # Build directory
│   ├── myproject.fpga_emu     # Emulation executable
│   ├── myproject.fpga_sim     # Simulation executable
│   ├── myproject.fpga         # FPGA executable
│   ├── libmyproject-*.so      # Python library
│   └── reports/               # Optimization reports
└── README.md

I/O Types: io_parallel vs io_stream

io_parallel

Single Task Architecture:

All layers execute in one SYCL task
Relies on pipelining within the task
Lower overhead for small models
Sequential layer execution

// Generated SYCL code structure
queue.submit([&](handler &h) {
    h.single_task<MyProject>([=]() {
        // All layers in one kernel
        layer1_output = layer1(input);
        layer2_output = layer2(layer1_output);
        output = layer3(layer2_output);
    });
});

io_stream

Multi-Task Architecture:

Each layer in separate task_sequence
Layers execute in parallel
Connected via SYCL pipes
Higher throughput for large models
Similar to dataflow in Vitis HLS

// Generated SYCL code structure
queue.submit([&](handler &h) {
    h.single_task<Layer1>([=]() {
        auto data = input_pipe::read();
        auto result = layer1(data);
        layer1_to_layer2_pipe::write(result);
    });
});

queue.submit([&](handler &h) {
    h.single_task<Layer2>([=]() {
        auto data = layer1_to_layer2_pipe::read();
        auto result = layer2(data);
        layer2_to_layer3_pipe::write(result);
    });
});

Choosing io_type

# Small models (< 5 layers)
config = {'Model': {'IOType': 'io_parallel'}}

# Large models (> 5 layers) or streaming data
config = {'Model': {'IOType': 'io_stream'}}

Precision Types

oneAPI backend uses Algorithmic C (AC) datatypes:

# Fixed-point: ac_fixed<width, int_width, signed>
config['layer']['Precision'] = 'ac_fixed<16,6,true>'
config['layer']['accum_t'] = 'ac_fixed<24,12,true>'

# Integer: ac_int<width, signed>
config['layer']['index_t'] = 'ac_int<8,false>'

Common Precision Settings

# High precision (16-bit)
config['layer']['Precision'] = 'ac_fixed<16,6,true>'

# Quantized (8-bit)
config['layer']['Precision'] = 'ac_fixed<8,3,true>'

# Wide accumulator
config['layer']['accum_t'] = 'ac_fixed<32,16,true>'

Performance Optimization

Reuse Factor Tuning

# High parallelism, more resources
config['dense']['ReuseFactor'] = 1

# Balanced
config['dense']['ReuseFactor'] = 16

# Low resources, higher latency
config['dense']['ReuseFactor'] = 64

Hyperopt Handshaking

Enable optimized communication between tasks:

hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    backend='oneAPI',
    hyperopt_handshake=True  # Enable optimized handshaking
)

Benefits:

Reduced latency between tasks
Better throughput for io_stream
Optimized FIFO depths

Winograd Convolution

Automatic transformation for 3x3 convolutions:

config['conv2d'] = {
    'Implementation': 'Winograd',  # or 'combination'
    'ReuseFactor': 8
}

Python Integration

Compile for Python

# Build shared library
library_path = hls_model.compile()

# Or explicitly build library
report = hls_model.build(build_type='lib', run=False)

Use in Python

import numpy as np

# Predict with compiled library
X_test = np.random.rand(100, 784).astype(np.float32)
y_pred = hls_model.predict(X_test)

print(f"Predictions shape: {y_pred.shape}")

Performance Characteristics

Resource Usage Estimates

Small MLP (3 layers, 64 neurons):

ALMs: 8K-20K
DSPs: 15-35
M20K: 15-40
Registers: 10K-30K

CNN (3 conv + 2 dense):

ALMs: 50K-150K
DSPs: 100-300
M20K: 100-400
Registers: 50K-200K

Latency Patterns

io_parallel:

Latency = Σ(layer_operations / parallel_operations)
Throughput = 1 / latency (no pipelining between inferences)

io_stream:

Latency = Σ(layer_latency)
Throughput = 1 / max(layer_II) (pipelined)

Clock Frequencies

Agilex 7: 300-450 MHz
Stratix 10: 300-400 MHz
Arria 10: 200-300 MHz

Advanced Features

EinsumDense Support

oneAPI backend supports Einsum operations:

from tensorflow.keras.layers import EinsumDense

model = Sequential([
    EinsumDense(
        equation='ab,bc->ac',
        output_shape=(64,),
        bias_axes='c'
    )
])

# Converts automatically
hls_model = hls4ml.converters.convert_from_keras_model(
    model, backend='oneAPI'
)

Custom Parallelization

config['einsum_dense'] = {
    'parallelization_factor': 4,
    'Strategy': 'resource'
}

Troubleshooting

icpx compiler not found

# Source oneAPI environment
source /opt/intel/oneapi/setvars.sh

# Verify installation
which icpx
icpx --version

# Check oneAPI installation
ls /opt/intel/oneapi/compiler/latest/

CMake configuration failed

# Ensure CMake version is sufficient
cmake --version  # Should be >= 3.10

# Clean build directory
rm -rf build
mkdir build
cd build
cmake ..

Softmax dimension error (io_parallel)

# Softmax requires 1D input in io_parallel mode

# ❌ Wrong:
model.add(Conv2D(...))
model.add(Softmax())  # Multi-dimensional

# ✅ Correct:
model.add(Conv2D(...))
model.add(Flatten())  # Flatten to 1D
model.add(Dense(10))
model.add(Softmax())  # 1D input

Resource usage too high

Increase reuse factors
Use io_stream for large models
Reduce precision
Enable weight compression (not yet supported)
Partition model into smaller graphs

AC type validation failed

The oneAPI backend validates AC datatypes:

# Ensure all precision specifications are valid AC types
config['layer']['Precision'] = 'ac_fixed<16,6,true>'  # Valid
# Not: 'ap_fixed<16,6>' (Xilinx type)

Differences from Quartus Backend

Feature	Quartus (i++)	oneAPI (icpx)
Compiler	Intel HLS	DPC++/SYCL
Build System	Makefile	CMake
io_stream	Limited	Full support with task_sequence
Python Integration	Not supported	Native support
Profiling	Supported	Not yet
Tracing	Supported	Not yet
BramFactor	Supported	Not yet
Active Development	No	Yes

Migration from Quartus

To migrate from Quartus to oneAPI:

# Old Quartus code
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    backend='Quartus',
    part='Arria10'
)

# New oneAPI code
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    backend='oneAPI',
    part='Agilex7'  # or 'Arria10', 'Stratix10'
)

Key changes:

AC datatypes remain compatible
Build system: Makefile → CMake
Build command: make → make <target>
Executables have extensions: .fpga_emu, .fpga_sim, .fpga

Example: Complete Workflow

import hls4ml
from tensorflow import keras
import numpy as np

# Load model
model = keras.models.load_model('my_model.h5')

# Configure
config = hls4ml.utils.config_from_keras_model(model, granularity='name')
config['Model']['Strategy'] = 'Resource'
config['Model']['ReuseFactor'] = 16
config['Model']['IOType'] = 'io_stream'

# Convert to oneAPI
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=config,
    output_dir='oneapi_prj',
    backend='oneAPI',
    part='Agilex7',
    clock_period=4,  # 250 MHz
    hyperopt_handshake=True
)

# Compile
hls_model.compile()

# Test with emulation
print("Running emulation...")
report_emu = hls_model.build(build_type='fpga_emu', run=True)

# Generate reports
print("Generating reports...")
report = hls_model.build(build_type='report', run=False)

print(f"Estimated resources: ALM={report['ALM']}, DSP={report['DSP']}")
print(f"Estimated fmax: {report['Fmax']} MHz")

# Build for FPGA (optional - takes hours)
# report_fpga = hls_model.build(build_type='fpga', run=False)

Quartus Backend

Legacy Intel HLS backend

Intel oneAPI Docs

Official Intel oneAPI documentation

FIFO Depth

Optimize streaming architectures

API Reference

Python API documentation

Getting Started

Core Concepts

Frontends

Backends

Advanced Features

Internals

​Overview

​When to Use oneAPI Backend

​Installation and Setup

​Prerequisites

​Environment Setup

​Configuration

​Basic Configuration

​Configuration Options

​Layer Configuration

​Dense Layers

​Convolutional Layers

​Recurrent Layers

​Build Process

​Build Commands

​Build Targets

​CMake Build System

​Example Project Structure

​I/O Types: io_parallel vs io_stream

​io_parallel

​io_stream

​Choosing io_type

​Precision Types

​Common Precision Settings

​Performance Optimization

​Reuse Factor Tuning

​Hyperopt Handshaking

​Winograd Convolution

​Python Integration

​Compile for Python

​Use in Python

​Performance Characteristics

​Resource Usage Estimates

​Latency Patterns

​Clock Frequencies

​Advanced Features

​EinsumDense Support

​Custom Parallelization

​Troubleshooting

​Differences from Quartus Backend

​Migration from Quartus

​Example: Complete Workflow

​Related Resources

Quartus Backend

Intel oneAPI Docs

FIFO Depth

API Reference

Build docs developers (and LLMs) love

Overview

When to Use oneAPI Backend

Installation and Setup

Prerequisites

Environment Setup

Configuration

Basic Configuration

Configuration Options

Layer Configuration

Dense Layers

Convolutional Layers

Recurrent Layers

Build Process

Build Commands

Build Targets

CMake Build System

Example Project Structure

I/O Types: io_parallel vs io_stream

io_parallel

io_stream

Choosing io_type

Precision Types

Common Precision Settings

Performance Optimization

Reuse Factor Tuning

Hyperopt Handshaking

Winograd Convolution

Python Integration

Compile for Python

Use in Python

Performance Characteristics

Resource Usage Estimates

Latency Patterns

Clock Frequencies

Advanced Features

EinsumDense Support

Custom Parallelization

Troubleshooting

Differences from Quartus Backend

Migration from Quartus

Example: Complete Workflow

Related Resources