Quartus Backend

The Quartus backend is deprecated and will be removed in a future version. Users should migrate to the oneAPI backend.

Overview

The Quartus backend enables deployment of neural networks on Intel/Altera FPGAs using the discontinued Intel HLS compiler. It generates C++ code that is compiled with the i++ compiler and integrated into Quartus Prime designs.

When to Use Quartus Backend

Legacy projects: Maintaining existing Intel HLS-based designs
Specific requirements: Features not yet available in oneAPI backend
- Profiling and tracing
- BramFactor option for weight storage

For new projects, use the oneAPI backend which provides better io_stream support and is actively maintained.

Installation and Setup

Prerequisites

Intel HLS Compiler (ensure i++ is on PATH)
Quartus Prime for FPGA synthesis
Python 3.8 or higher
hls4ml library installed

Environment Setup

# Verify Intel HLS compiler is available
command -v i++

# Verify Quartus is available (for FPGA synthesis)
command -v quartus_sh

# Set Intel FPGA environment (adjust path for your installation)
source /opt/intelFPGA_pro/hls/init_hls.sh

Configuration

Basic Configuration

Create a model configuration for the Quartus backend:

import hls4ml

config = hls4ml.utils.config_from_keras_model(
    model,
    granularity='name',
    backend='Quartus'
)

# Convert model
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=config,
    output_dir='my_quartus_project',
    backend='Quartus',
    part='Arria10',
    clock_period=5,
    io_type='io_parallel'
)

Configuration Options

The Quartus backend supports the following configuration parameters:

part

string

default:"Arria10"

FPGA device family:

Arria10
Stratix10
Agilex

clock_period

int

default:"5"

Clock period in nanoseconds (5ns = 200MHz)

io_type

string

default:"io_parallel"

I/O implementation type:

io_parallel: Parallel data processing
io_stream: Streaming architecture (limited support)

write_tar

bool

default:"false"

Compress output directory into .tar.gz file

Layer Configuration

The Quartus backend only supports Resource strategy. There is no Latency implementation.

Dense Layers

config['dense_layer'] = {
    'ReuseFactor': 16,
    'Strategy': 'Resource',  # Only Resource supported
    'Precision': 'ac_fixed<16,6,true>',
    'BramFactor': 0  # Weight storage: 0=LUT, >0=BRAM
}

Convolutional Layers

config['conv2d_layer'] = {
    'ReuseFactor': 8,
    'ParallelizationFactor': 1,
    'Implementation': 'im2col',  # or 'Winograd', 'combination'
    'Precision': 'ac_fixed<16,6,true>'
}

Convolution Implementations:

im2col: Image-to-column transformation followed by matrix multiply
Winograd: Winograd fast convolution (for 3x3 filters)
combination: Automatic selection at compile-time

Recurrent Layers

config['gru_layer'] = {
    'ReuseFactor': 1,
    'RecurrentReuseFactor': 1,
    'Strategy': 'Resource',
    'table_size': 1024,
    'table_t': 'ac_fixed<18,8,true>'
}

Build Process

Compilation Commands

# Compile the model
hls_model.compile()

# Build with Intel HLS compiler
report = hls_model.build(
    synth=True,              # Run HLS synthesis
    fpgasynth=False,         # Run Quartus FPGA synthesis
    log_level=1,             # Logging verbosity (0, 1, 2)
    cont_if_large_area=False # Continue if area estimate exceeds device
)

Build Options

Quartus Build Parameters

Option	Description	Default
`synth`	Run Intel HLS synthesis	`True`
`fpgasynth`	Run Quartus FPGA compilation	`False`
`log_level`	Verbosity level (0-2)	`1`
`cont_if_large_area`	Continue if design exceeds device resources	`False`

Build Process Details

The build process uses a Makefile:

cd my_quartus_project

# HLS synthesis only
make myproject-fpga

# HLS synthesis with Quartus compile
make myproject-fpga QUARTUS_COMPILE=--quartus-compile

# Run simulation
./myproject-fpga

Example Project Structure

my_quartus_project/
├── firmware/
│   ├── myproject.cpp          # Main implementation
│   ├── myproject.h            # Header file
│   ├── parameters.h           # Network parameters
│   ├── weights/               # Weight data
│   └── nnet_utils/            # Utility functions
├── tb_data/
│   ├── tb_input_features.dat
│   └── tb_output_predictions.dat
├── myproject_test.cpp         # Testbench
├── Makefile                   # Build system
├── myproject-fpga             # Executable (after build)
└── reports/                   # Synthesis reports
    ├── report.html
    └── lib/

Precision Types

Quartus backend uses Algorithmic C (AC) datatypes:

# Fixed-point: ac_fixed<width, int_width, signed>
config['layer']['Precision'] = 'ac_fixed<16,6,true>'
config['layer']['accum_t'] = 'ac_fixed<24,12,true>'

# Integer: ac_int<width, signed>
config['layer']['index_t'] = 'ac_int<8,false>'

Common Precision Settings

Type	AC Datatype	Description
Input	`ac_fixed<16,6,true>`	16-bit, 6 integer bits, signed
Weights	`ac_fixed<8,3,true>`	8-bit quantized weights
Accumulator	`ac_fixed<24,12,true>`	Wide accumulator
Activation	`ac_fixed<16,6,true>`	Activation output

Performance Optimization

Reuse Factor Strategy

# All layers use Resource strategy
# Reuse factor controls parallelism

# More parallel, higher resources
config['dense']['ReuseFactor'] = 1

# More serial, lower resources
config['dense']['ReuseFactor'] = 64

Weight Storage Optimization

# Store weights in LUTs (default)
config['dense']['BramFactor'] = 0

# Store weights in BRAM
config['dense']['BramFactor'] = 1000  # Threshold in elements

Winograd Convolution

For 3x3 convolutions, Winograd can reduce operations:

config['conv2d'] = {
    'Implementation': 'Winograd',  # Faster for 3x3
    'ReuseFactor': 8
}

Performance Characteristics

Resource Usage Estimates

Small MLP (3 layers, 64 neurons):

ALMs: 5K-15K
DSPs: 10-30
M20K: 10-50

Small CNN (3 conv + 2 dense):

ALMs: 30K-100K
DSPs: 50-200
M20K: 50-200

Latency Characteristics

Latency = Σ(layer_operations / parallel_factor)

For Dense layer:
  operations = n_in × n_out
  parallel_factor = n_in × n_out / reuse_factor
  
For Conv2D layer:
  operations = out_h × out_w × filt_h × filt_w × n_chan × n_filt
  parallel_factor depends on implementation

Clock Frequencies

Arria 10: 200-300 MHz typical
Stratix 10: 300-400 MHz typical
Agilex: 300-450 MHz typical

Activation Functions

The Quartus backend uses dense_tanh instead of standard tanh for compatibility with the AC datatype library.

This substitution happens automatically:

# Keras model uses tanh
Dense(64, activation='tanh')

# Quartus backend converts to dense_tanh internally

Limitations

Resource Strategy Only

# ✅ Supported
config['layer']['Strategy'] = 'Resource'

# ❌ Not supported
config['layer']['Strategy'] = 'Latency'  # Will fail
config['layer']['Strategy'] = 'Compressed'  # Not available

io_stream Limitations

Limited support compared to oneAPI
No automatic FIFO optimization
Streaming between layers is basic

Softmax Constraints

For io_parallel mode:

# Softmax only works on 1D tensors
model.add(Flatten())  # Required before Softmax
model.add(Dense(10, activation='softmax'))

Migration to oneAPI

To migrate from Quartus to oneAPI backend:

# Change backend specification
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=config,
    backend='oneAPI',  # Changed from 'Quartus'
    output_dir='my_oneapi_project',
    part='Agilex7'
)

Migration Considerations

Configuration changes

part parameter: Use device family name (e.g., ‘Agilex7’)
Precision types: AC datatypes remain compatible
Strategy: Still only Resource supported
BramFactor: Not yet supported in oneAPI

Build system changes

Makefile → CMake build system
i++ compiler → icpx (Intel oneAPI DPC++ compiler)
Different build targets: fpga_emu, report, fpga_sim, fpga

Feature differences

Not yet in oneAPI:

Profiling
Tracing
BramFactor

Better in oneAPI:

io_stream support
Task parallelism
Python integration

Troubleshooting

Intel HLS compiler not found

# Check installation
which i++

# Source Intel HLS environment
source /opt/intelFPGA_pro/hls/init_hls.sh

# Verify version
i++ --version

Compilation fails with area error

If the design exceeds device resources:

# Build with override flag
report = hls_model.build(
    synth=True,
    cont_if_large_area=True  # Continue despite area estimate
)

Then optimize:

Increase reuse factors
Reduce precision
Use BramFactor for weights

Latency strategy not supported error

# ❌ This will fail:
config['layer']['Strategy'] = 'Latency'

# ✅ Use Resource strategy:
config['layer']['Strategy'] = 'Resource'
config['layer']['ReuseFactor'] = 1  # For maximum parallelism

Softmax dimension error

# ❌ Multi-dimensional Softmax in io_parallel
model.add(Dense(10, activation='softmax'))  # After Conv2D

# ✅ Flatten before Softmax
model.add(Flatten())
model.add(Dense(10, activation='softmax'))

Example: Complete Workflow

import hls4ml
from tensorflow import keras
import numpy as np

# Load model
model = keras.models.load_model('my_model.h5')

# Create configuration
config = hls4ml.utils.config_from_keras_model(model, granularity='name')
config['Model']['Strategy'] = 'Resource'
config['Model']['ReuseFactor'] = 32

# Set precision
for layer in config['LayerName'].keys():
    config['LayerName'][layer]['Precision'] = 'ac_fixed<16,6,true>'

# Convert to Quartus HLS
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=config,
    output_dir='quartus_prj',
    backend='Quartus',
    part='Arria10',
    clock_period=5,
    io_type='io_parallel'
)

# Compile and test
hls_model.compile()
X_test = np.random.rand(100, 784)
y_keras = model.predict(X_test)
y_hls = hls_model.predict(X_test)

print(f"Accuracy match: {np.allclose(y_keras, y_hls, atol=1e-2)}")

# Build HLS project
report = hls_model.build(
    synth=True,
    fpgasynth=False,  # Set True for full FPGA compile
    log_level=1
)

print(f"Estimated resources: ALM={report['ALM']}, DSP={report['DSP']}")
print(f"Estimated latency: {report['Latency']} cycles")

oneAPI Backend

Modern Intel FPGA backend

Model Conversion

Learn about model conversion

Resource Optimization

Reduce resource usage

Precision Guide

Configure numeric precision

Getting Started

Core Concepts

Frontends

Backends

Advanced Features

Internals

​Overview

​When to Use Quartus Backend

​Installation and Setup

​Prerequisites

​Environment Setup

​Configuration

​Basic Configuration

​Configuration Options

​Layer Configuration

​Dense Layers

​Convolutional Layers

​Recurrent Layers

​Build Process

​Compilation Commands

​Build Options

​Build Process Details

​Example Project Structure

​Precision Types

​Common Precision Settings

​Performance Optimization

​Reuse Factor Strategy

​Weight Storage Optimization

​Winograd Convolution

​Performance Characteristics

​Resource Usage Estimates

​Latency Characteristics

​Clock Frequencies

​Activation Functions

​Limitations

​Resource Strategy Only

​io_stream Limitations

​Softmax Constraints

​Migration to oneAPI

​Migration Considerations

​Troubleshooting

​Example: Complete Workflow

​Related Resources

oneAPI Backend

Model Conversion

Resource Optimization

Precision Guide

Build docs developers (and LLMs) love

Overview

When to Use Quartus Backend

Installation and Setup

Prerequisites

Environment Setup

Configuration

Basic Configuration

Configuration Options

Layer Configuration

Dense Layers

Convolutional Layers

Recurrent Layers

Build Process

Compilation Commands

Build Options

Build Process Details

Example Project Structure

Precision Types

Common Precision Settings

Performance Optimization

Reuse Factor Strategy

Weight Storage Optimization

Winograd Convolution

Performance Characteristics

Resource Usage Estimates

Latency Characteristics

Clock Frequencies

Activation Functions

Limitations

Resource Strategy Only

io_stream Limitations

Softmax Constraints

Migration to oneAPI

Migration Considerations

Troubleshooting

Example: Complete Workflow

Related Resources