Guide – How to: design an accelerator in C/C++ (Xilinx Vivado HLS)
Latest update: 2020-12-21
This guide illustrates how to create and integrate an accelerator with the ESP high-level synthesis (HLS) flow, using C++ as the specification language and Xilinx Vivado HLS to generate a corresponding RTL implementation. Most of the C/C++ flow is identical to the SystemC flow. We will refer to the guide How to: design an accelerator in SystemC for all overlapping steps.
Note: Make sure to complete the prequisite tutorials before getting started with this one. This tutorial assumes that accelerator designers are familiar with the ESP infrastructure and know how to run basic make targets to create a simple instance of ESP, integrating just a single core.
1. Accelerator design
In this guide and in the corresponding prebuilt material we integrate an accelerator that performs multiply and accumulate (MAC) on integer vectors of configurable length. Specifically, we implement the following small kernel of computation.
// MAC
int *data_in = new int[mac_vec][mac_len];
int *acc = new int[mac_vec];
for (int iterations = 0; iterations < mac_n; iterations++) {
load_input_data(data_in);
for (int j = 0; j < mac_vec; j++) {
acc[j] = 0;
for (int i = 0; i < mac_len; i += 2)
acc[j] += data_in[j][i] * data_in[j][i+1]
}
store_output_data(acc)
}
This is a simple example that will introduce
users to the automation mechanisms offered by ESP. The tutorial,
instead, will not explore the capabilities of the HLS
tool for design-space exploration.
Note: The users have access to prebuilt material to run the tutorial on an FPGA, without executing all the previous steps. See the ‘FPGA prototyping with prebuilt material’ section at the end of this guide.
Accelerator skeleton
ESP provides an interactive script that generates all of the hardware and software sockets to quickly integrate a new accelerator in a full SoC. The generation of the accelerator skeleton is the same for both SystemC and C/C++ flows, except for the selected design flow, which is based on Cadence Stratus HLS for SystemC and Vivado HLS for C/C++.
# Move to the ESP root folder
cd <esp>
# Run the accelerator initialization script and respond as follows
./tools/accgen/accgen.sh
=== Initializing ESP accelerator template ===
* Enter accelerator name [dummy]: mac
* Select design flow (Stratus HLS, Vivado HLS) [S]: V
* Enter ESP path [/space/esp-master]:
* Enter unique accelerator id as three hex digits [04A]: 056
* Enter accelerator registers
- register 0 name [size]: mac_len
- register 0 default value [1]: 64
- register 0 max value [64]:
- register 1 name []: mac_vec
- register 1 default value [1]: 100
- register 1 max value [100]:
- register 2 name []: mac_n
- register 2 default value [1]:
- register 2 max value [1]: 16
- register 3 name []:
* Configure PLM size and create skeleton for load and store:
- Enter data bit-width (8, 16, 32, 64) [32]:
- Enter input data size in terms of configuration registers (e.g. 2 * mac_len}) [mac_len]: mac_len * mac_vec
data_in_size_max = 6400
- Enter output data size in terms of configuration registers (e.g. 2 * mac_len) [mac_len]: mac_vec
data_out_size_max = 100
- Enter an integer chunking factor (use 1 if you want PLM size equal to data size) [1]:
Input PLM has 6400 32-bits words
Output PLM has 100 32-bits words
- Enter number of input data to be processed in batch (can be function of configuration registers) [1]: mac_n
batching_factor_max = 16
=== Generated accelerator skeleton for mac ===
You can find a description of the parameter configured by the accelerator
initialization script in the Section Accelerator skeleton of the guide for the
SystemC flow.
Executing the initialization script with the above parameters,
generates the accelerator source files and testbench in C++,
together with the HLS scripts. These files are located at the path
accelerators/vivado_hls/mac_vivado/hw
.
In addition, the accelerator’s device driver, bare metal application
and user-space linux application are generated at the path
accelerators/vivado_hls/mac_vivado/sw
.
# Complete list of files generated and modifie
<esp>/accelerators/vivado_hls/mac_vivado
├── hw
│ ├── mac.xml # Accelerator description and register list
│ ├── hls # HLS scripts
│ │ ├── common.tcl
│ │ ├── custom.tcl # Customizable system-level configuration
│ │ ├── directives.tcl # User-defined HLS directives
│ │ └── Makefile
│ ├── inc # Folder for code header files
│ │ ├── espacc_config.h # Data types and local memory size definitions
│ │ └── espacc.h # Constants and defines for the ESP accelerator
│ ├── src # Accelerator source files
│ │ └── espacc.cc # Accelerator specification
│ └── tb
│ └── tb.cc # Testbench
└── sw
├── baremetal # Bare-metal test application
│ ├── mac.c
│ └── Makefile
└── linux
├── app # Linux test application
│ ├── mac.c
│ └── Makefile
├── driver # Linux device driver
│ ├── mac_vivado.c
│ ├── Kbuild
│ └── Makefile
└── include
└── mac_vivado.h
Accelerator behavior implementation
Similarly to the SystemC HLS flow, this step consists in editing the compute portion of the ESP accelerator skeleton.
Source files for the MAC accelerator are generated at the path
accelerators/vivado_hls/mac_vivado/hw/src/
.
From this folder, open espacc.cpp and locate the definition of the function
process compute(). Scroll down to the code section marked by the
comment // TODO implement compute functionality
and replace the remaining code
with the following.
// Compute
unsigned in_length = mac_len * mac_vec;
unsigned out_length = mac_vec;
unsigned vector_index = 0;
unsigned vector_number = 0;
int acc = 0;
for (int in_rem = in_length; in_rem > 0; in_rem -= SIZE_IN_CHUNK_DATA)
{
unsigned in_len = in_rem > SIZE_IN_CHUNK_DATA ? SIZE_IN_CHUNK_DATA : in_rem;
// Computing phase implementation
for (int i = 0; i < in_len; i += 2) {
// Multiply and accumulate
acc += _inbuff[i] * _inbuff[i+1];
vector_index += 2;
// Write accumulated result
if (vector_index == mac_len) {
_outbuff[vector_number] = acc;
acc = 0;
vector_index = 0;
vector_number++;
}
}
}
NOTE: The prebuilt material contains the complete source code of the MAC accelerator.
Without editing the code, the generated accelerator implements an identity function that moves data from input to output. With respect to this skeleton, the snippet above implements the following changes.
- The inner loop processes two elements of the input data per iteration (
i += 2
). - A new variable
acc
is used to accumulate one vector of lengthin_len
. - Writes to the output memory occur only when the accumulation for one vector is completed.
- Two additional counters,
vector_index
andvector_number
, are used to keep track of the position in the current vector and of the number of the vector that is being processed.
Please take a moment to understand and get familiar with these changes in the code above.
Differently from the SystemC flow, this time we do not need to explicitly handle
a ping-pong buffer. Instead, we apply the Vivado HLS dataflow
directive to
create a coarse-grain pipeline across the load(), compute(), and store()
functions. The directive infers a ping-pong buffer for the memories elements
accessed by the three functions. You can see how the default HLS directives are
applied by reading the TCL script
accelerators/vivado_hls/mac_vivado/hw/hls/common.tcl
.
Testbench implementation
The testbench code is generated at the path
accelerators/vivado_hls/mac_vivado/hw/tb/
. To complete and specialize it for the target
accelerator, open tb.cc and locate the initialization of the input array
inbuff
and of the golden output array outbuff_gold
. Replace the default initialization
code with the following.
// Prepare input data
for(unsigned i = 0; i < mac_n; i++)
for(unsigned j = 0; j < mac_len * mac_vec; j++)
inbuff[i * in_words_adj + j] = (word_t) j % mac_vec;
for(unsigned i = 0; i < dma_in_size; i++)
for(unsigned k = 0; k < VALUES_PER_WORD; k++)
mem[i].word[k] = inbuff[i * VALUES_PER_WORD + k];
// Set golden output
for (int i = 0; i < mac_n; i++)
for (int j = 0; j < mac_vec; j++) {
outbuff_gold[i * out_words_adj + j] = 0;
for (int k = 0; k < mac_len; k += 2)
outbuff_gold[i * out_words_adj + j] +=
inbuff[i * in_words_adj + j * mac_len + k] * inbuff[i * in_words_adj + j * mac_len + k + 1];
}
For the purpose of this tutorial, the input array can be initialized with any dataset, including random numbers. However, make sure your MAC compute body doesn’t overflow the integer representation to avoid validation errors.
HLS configuration
The HLS script is fully generated at the path accelerators/vivado_hls/mac_vivado/hw/hls
and it defines synthesis directives for all of the FPGAs supported by ESP. For
every target FPGA, two default HLS configurations are defined: dma32_w<W>
and
dma64_w<W>
, where w<W>
is the width of the data token (e.g. 16, 32, 64
bits). These two configurations are necessary for integration with both 32-bits
and 64-bits architectures. You are free to define more implementations in the
synthesis script custom.tcl
which may or may not exist for both 32 and 64 bits
systems, but the suffix _dma32_w<W>
or _dma64_w<W>
must be used when naming
the HLS configurations. Look ad the synthesis script common.tcl
at the path
accelerators/vivado_hls/mac_vivado/hw/hls
to see how you can create new RTL
configurations for Vivado HLS. In addition, you may set common HLS directives,
across all RTL configurations in the script directives.tcl
, located at the
same path.
Please do not modify common.tcl
, unless you know what you are doing, to avoid
issues with existing accelerator examples for Vivado HLS.
Simulation and RTL implementation
Choose one of the supported boards to create your new SoC instance. Design paths in this tutorial refer to the Xilinx VC707 evaluation board, but all instructions are valid for any of the supported boards.
After creating the MAC accelerator, ESP discovers it in the library of components and generates a set of make targets for it.
# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t
# Run behavioral simulation and HSL with Vivado HLS
make mac_vivado-hls
The unit test C++-RTL co-simulation is not supported, because the C++ testbench
cannot model the DMA controller and respond to the blocking requests of the
accelerator.
Note: using the ESP accelerator template generator guarantees that the generated RTL implements an interface compliant with the ESP accelerator socket. However, if you need to substantially modify the load() and store() functions, we recommend you choose the SystemC flow. This allows you to debug DMA transactions of the accelerator after running HLS with a unit SystemC testbench that models the behavior of the DMA controller in the ESP accelerator tile.
Accelerator debug
You can debug your accelerator by executing the C++ testbench within the Vivado
HLS environment with make mac_vivado-hls
. Before attempting synthesis, behavioral
simulation is executed and any print statement embedded in either the testbench
or the accelerator source code will show on the shell.
This simulation is simply executing the C++ program defined in the testbench and
is therefore very fast. However, differently from the SystemC flow, this
simulation is completely un-timed and will not detect any bug at the interface.
With the C/C++ flow you can only debug I/O related issues through the RTL system
simulation.
2. Accelerator integration
The integration flow is identical for both the C/C++ and the SystemC flows. Please refer to Part 2 of the guide How to: design an accelerator in SystemC.
FPGA prototyping with prebuilt material
With the provided prebuilt material, you can run the tutorial on FPGA directly. Each packet is marked with the first digits of the Git revision it was created and tested with.
The packet contains the following:
- The source code, testbench and HLS scripts for the MAC accelerator (
accelerators/vivado_hls/mac
) - The bare-metal test application and the Linux device driver and test application for the MAC accelerator
(
soft/[ariane|leon3]/drivers/mac
) - Two working folders for Xilinx VCU118 and Xilinx VC707, each including:
- The Linux image (
linux.bin
) - The Baremetal application (
mac.bin
) - The boot loader image (
prom.bin
) - The FPGA bitstream (
top.bit
) - The hidden configuration files for the design (
.grlib_config
and.esp_config
) - A script to run the design on FPGA (
runme.sh
)
- The Linux image (
Note: this prebuilt package will create an accelerator of name mac. This will cause a conflict with the prebuilt package for the SystemC flow. If you use both prebuilt packages, please run them separately, each in a clean ESP repository. If, instead, you follow both tutorials, you may want to change the name of the two accelerators and avoid such conflict.
Decompress the content of the packet from the ESP root folder to make sure all
files are extracted to the right location.
cd <esp>
tar xf ESP_CppAcc_GitRev.c7d878d.tar.gz
Enter one of the soc instances extracted from the packet.
cd socs/cpp_acc_vc707
Follow the “UART interface” instructions from the “How to: design a
single-core SoC” guide,
then launch the runme.sh script
# Execute baremetal test
./runme.sh mac
# Boot Linux
./runme.sh
Finally From the ESP Linux terminal run the MAC test application
$ cd /applications/test/
$ ./mac.exe