Guide – How to: design an accelerator in C/C++ (Xilinx Vivado HLS)

Latest update: 2020-12-21

This guide illustrates how to create and integrate an accelerator with the ESP high-level synthesis (HLS) flow, using C++ as the specification language and Xilinx Vivado HLS to generate a corresponding RTL implementation. Most of the C/C++ flow is identical to the SystemC flow. We will refer to the guide How to: design an accelerator in SystemC for all overlapping steps.

1. Accelerator design
2. Accelerator integration
- FPGA prototyping with prebuilt material

Note: Make sure to complete the prequisite tutorials before getting started with this one. This tutorial assumes that accelerator designers are familiar with the ESP infrastructure and know how to run basic make targets to create a simple instance of ESP, integrating just a single core.

1. Accelerator design

In this guide and in the corresponding prebuilt material we integrate an accelerator that performs multiply and accumulate (MAC) on integer vectors of configurable length. Specifically, we implement the following small kernel of computation.

// MAC
int *data_in = new int[mac_vec][mac_len];
int *acc  = new int[mac_vec];
for (int iterations = 0; iterations < mac_n; iterations++) {
	load_input_data(data_in);
	for (int j = 0; j < mac_vec; j++) {
		acc[j] = 0;
		for (int i = 0; i < mac_len; i += 2)
			acc[j] += data_in[j][i] * data_in[j][i+1]
	}
	store_output_data(acc)
}

This is a simple example that will introduce users to the automation mechanisms offered by ESP. The tutorial, instead, will not explore the capabilities of the HLS tool for design-space exploration.

Note: The users have access to prebuilt material to run the tutorial on an FPGA, without executing all the previous steps. See the ‘FPGA prototyping with prebuilt material’ section at the end of this guide.

Accelerator skeleton

ESP provides an interactive script that generates all of the hardware and software sockets to quickly integrate a new accelerator in a full SoC. The generation of the accelerator skeleton is the same for both SystemC and C/C++ flows, except for the selected design flow, which is based on Cadence Stratus HLS for SystemC and Vivado HLS for C/C++.

# Move to the ESP root folder
cd <esp>
# Run the accelerator initialization script and respond as follows
./tools/accgen/accgen.sh
=== Initializing ESP accelerator template ===

  * Enter accelerator name [dummy]: mac
  * Select design flow (Stratus HLS, Vivado HLS) [S]: V
  * Enter ESP path [/space/esp-master]:
  * Enter unique accelerator id as three hex digits [04A]: 056
  * Enter accelerator registers
    - register 0 name [size]: mac_len
    - register 0 default value [1]: 64
    - register 0 max value [64]:
    - register 1 name []: mac_vec
    - register 1 default value [1]: 100
    - register 1 max value [100]:
    - register 2 name []: mac_n
    - register 2 default value [1]:
    - register 2 max value [1]: 16
    - register 3 name []:
  * Configure PLM size and create skeleton for load and store:
    - Enter data bit-width (8, 16, 32, 64) [32]:
    - Enter input data size in terms of configuration registers (e.g. 2 * mac_len}) [mac_len]: mac_len * mac_vec
      data_in_size_max = 6400
    - Enter output data size in terms of configuration registers (e.g. 2 * mac_len) [mac_len]: mac_vec
      data_out_size_max = 100
    - Enter an integer chunking factor (use 1 if you want PLM size equal to data size) [1]:
      Input PLM has 6400 32-bits words
      Output PLM has 100 32-bits words
    - Enter number of input data to be processed in batch (can be function of configuration registers) [1]: mac_n
      batching_factor_max = 16

=== Generated accelerator skeleton for mac ===

You can find a description of the parameter configured by the accelerator initialization script in the Section Accelerator skeleton of the guide for the SystemC flow.

Executing the initialization script with the above parameters, generates the accelerator source files and testbench in C++, together with the HLS scripts. These files are located at the path accelerators/vivado_hls/mac_vivado/hw.

In addition, the accelerator’s device driver, bare metal application and user-space linux application are generated at the path accelerators/vivado_hls/mac_vivado/sw.

# Complete list of files generated and modifie
<esp>/accelerators/vivado_hls/mac_vivado
├── hw
│   ├── mac.xml              # Accelerator description and register list
│   ├── hls                  # HLS scripts
│   │   ├── common.tcl
│   │   ├── custom.tcl       # Customizable system-level configuration
│   │   ├── directives.tcl   # User-defined HLS directives
│   │   └── Makefile
│   ├── inc                  # Folder for code header files
│   │   ├── espacc_config.h  # Data types and local memory size definitions
│   │   └── espacc.h         # Constants and defines for the ESP accelerator
│   ├── src                  # Accelerator source files
│   │   └── espacc.cc        # Accelerator specification
│   └── tb
│       └── tb.cc            # Testbench
└── sw
    ├── baremetal            # Bare-metal test application
    │   ├── mac.c
    │   └── Makefile
    └── linux
        ├── app              # Linux test application
        │   ├── mac.c
        │   └── Makefile
        ├── driver           # Linux device driver
        │   ├── mac_vivado.c
        │   ├── Kbuild
        │   └── Makefile
        └── include
            └── mac_vivado.h

Accelerator behavior implementation

Similarly to the SystemC HLS flow, this step consists in editing the compute portion of the ESP accelerator skeleton.

Source files for the MAC accelerator are generated at the path accelerators/vivado_hls/mac_vivado/hw/src/. From this folder, open espacc.cpp and locate the definition of the function process compute(). Scroll down to the code section marked by the comment // TODO implement compute functionality and replace the remaining code with the following.

// Compute
unsigned in_length = mac_len * mac_vec;
unsigned out_length = mac_vec;

unsigned vector_index = 0;
unsigned vector_number = 0;
int acc = 0;

for (int in_rem = in_length; in_rem > 0; in_rem -= SIZE_IN_CHUNK_DATA)
{

    unsigned in_len  = in_rem  > SIZE_IN_CHUNK_DATA  ? SIZE_IN_CHUNK_DATA  : in_rem;

    // Computing phase implementation
    for (int i = 0; i < in_len; i += 2) {

        // Multiply and accumulate
        acc += _inbuff[i] * _inbuff[i+1];

        vector_index += 2;

        // Write accumulated result
        if (vector_index == mac_len) {
            _outbuff[vector_number] = acc;

            acc = 0;
            vector_index = 0;
            vector_number++;
        }
    }
}

NOTE: The prebuilt material contains the complete source code of the MAC accelerator.

Without editing the code, the generated accelerator implements an identity function that moves data from input to output. With respect to this skeleton, the snippet above implements the following changes.

The inner loop processes two elements of the input data per iteration (i += 2).
A new variable acc is used to accumulate one vector of length in_len.
Writes to the output memory occur only when the accumulation for one vector is completed.
Two additional counters, vector_index and vector_number, are used to keep track of the position in the current vector and of the number of the vector that is being processed.

Please take a moment to understand and get familiar with these changes in the code above.

Differently from the SystemC flow, this time we do not need to explicitly handle a ping-pong buffer. Instead, we apply the Vivado HLS dataflow directive to create a coarse-grain pipeline across the load(), compute(), and store() functions. The directive infers a ping-pong buffer for the memories elements accessed by the three functions. You can see how the default HLS directives are applied by reading the TCL script accelerators/vivado_hls/mac_vivado/hw/hls/common.tcl.

Testbench implementation

The testbench code is generated at the path accelerators/vivado_hls/mac_vivado/hw/tb/. To complete and specialize it for the target accelerator, open tb.cc and locate the initialization of the input array inbuff and of the golden output array outbuff_gold. Replace the default initialization code with the following.

// Prepare input data
for(unsigned i = 0; i < mac_n; i++)
    for(unsigned j = 0; j < mac_len * mac_vec; j++)
        inbuff[i * in_words_adj + j] = (word_t) j % mac_vec;

for(unsigned i = 0; i < dma_in_size; i++)
    for(unsigned k = 0; k < VALUES_PER_WORD; k++)
        mem[i].word[k] = inbuff[i * VALUES_PER_WORD + k];

// Set golden output
for (int i = 0; i < mac_n; i++)
    for (int j = 0; j < mac_vec; j++) {
        outbuff_gold[i * out_words_adj + j] = 0;
        for (int k = 0; k < mac_len; k += 2)
        outbuff_gold[i * out_words_adj + j] +=
            inbuff[i * in_words_adj + j * mac_len + k] * inbuff[i * in_words_adj + j * mac_len + k + 1];
    }

For the purpose of this tutorial, the input array can be initialized with any dataset, including random numbers. However, make sure your MAC compute body doesn’t overflow the integer representation to avoid validation errors.

HLS configuration

The HLS script is fully generated at the path accelerators/vivado_hls/mac_vivado/hw/hls and it defines synthesis directives for all of the FPGAs supported by ESP. For every target FPGA, two default HLS configurations are defined: dma32_w<W> and dma64_w<W>, where w<W> is the width of the data token (e.g. 16, 32, 64 bits). These two configurations are necessary for integration with both 32-bits and 64-bits architectures. You are free to define more implementations in the synthesis script custom.tcl which may or may not exist for both 32 and 64 bits systems, but the suffix _dma32_w<W> or _dma64_w<W> must be used when naming the HLS configurations. Look ad the synthesis script common.tcl at the path accelerators/vivado_hls/mac_vivado/hw/hls to see how you can create new RTL configurations for Vivado HLS. In addition, you may set common HLS directives, across all RTL configurations in the script directives.tcl, located at the same path. Please do not modify common.tcl, unless you know what you are doing, to avoid issues with existing accelerator examples for Vivado HLS.

Simulation and RTL implementation

Choose one of the supported boards to create your new SoC instance. Design paths in this tutorial refer to the Xilinx VC707 evaluation board, but all instructions are valid for any of the supported boards.

After creating the MAC accelerator, ESP discovers it in the library of components and generates a set of make targets for it.

# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t

# Run behavioral simulation and HSL with Vivado HLS
make mac_vivado-hls

The unit test C++-RTL co-simulation is not supported, because the C++ testbench cannot model the DMA controller and respond to the blocking requests of the accelerator.

Note: using the ESP accelerator template generator guarantees that the generated RTL implements an interface compliant with the ESP accelerator socket. However, if you need to substantially modify the load() and store() functions, we recommend you choose the SystemC flow. This allows you to debug DMA transactions of the accelerator after running HLS with a unit SystemC testbench that models the behavior of the DMA controller in the ESP accelerator tile.

Accelerator debug

You can debug your accelerator by executing the C++ testbench within the Vivado HLS environment with make mac_vivado-hls. Before attempting synthesis, behavioral simulation is executed and any print statement embedded in either the testbench or the accelerator source code will show on the shell. This simulation is simply executing the C++ program defined in the testbench and is therefore very fast. However, differently from the SystemC flow, this simulation is completely un-timed and will not detect any bug at the interface. With the C/C++ flow you can only debug I/O related issues through the RTL system simulation.

2. Accelerator integration

The integration flow is identical for both the C/C++ and the SystemC flows. Please refer to Part 2 of the guide How to: design an accelerator in SystemC.

FPGA prototyping with prebuilt material

With the provided prebuilt material, you can run the tutorial on FPGA directly. Each packet is marked with the first digits of the Git revision it was created and tested with.

The packet contains the following:

The source code, testbench and HLS scripts for the MAC accelerator (accelerators/vivado_hls/mac)
The bare-metal test application and the Linux device driver and test application for the MAC accelerator (soft/[ariane|leon3]/drivers/mac)
Two working folders for Xilinx VCU118 and Xilinx VC707, each including:
- The Linux image (linux.bin)
- The Baremetal application (mac.bin)
- The boot loader image (prom.bin)
- The FPGA bitstream (top.bit)
- The hidden configuration files for the design (.grlib_config and .esp_config)
- A script to run the design on FPGA (runme.sh)

Note: this prebuilt package will create an accelerator of name mac. This will cause a conflict with the prebuilt package for the SystemC flow. If you use both prebuilt packages, please run them separately, each in a clean ESP repository. If, instead, you follow both tutorials, you may want to change the name of the two accelerators and avoid such conflict.

Decompress the content of the packet from the ESP root folder to make sure all files are extracted to the right location.

cd <esp>
tar xf ESP_CppAcc_GitRev.c7d878d.tar.gz

Enter one of the soc instances extracted from the packet.

cd socs/cpp_acc_vc707

Follow the “UART interface” instructions from the “How to: design a single-core SoC” guide, then launch the runme.sh script

# Execute baremetal test
./runme.sh mac
# Boot Linux
./runme.sh

Finally From the ESP Linux terminal run the MAC test application

$ cd /applications/test/
$ ./mac.exe