Documentation

Guide How to: design an accelerator in SystemC (Mentor Catapult HLS)

Latest update: 2022-07-07

This guide illustrates how to create and integrate an accelerator with the ESP high-level synthesis (HLS) flow. The guide uses SystemC as the design-specification language, and Mentor Catapult HLS as the HLS tool. Most of the flow is identical to the SystemC flow with Stratus HLS. We will refer to the guide How to: design an accelerator in SystemC (Cadence Stratus HLS) for all overlapping steps.

Note: Make sure to complete the prequisite tutorials before getting started with this one. This tutorial assumes that accelerator designers are familiar with the ESP infrastructure and know how to run basic make targets to create a simple instance of ESP, integrating just a single core.

1. Accelerator design

Introduction

In this guide, we will integrate an accelerator that performs multiply and accumulate (MAC) on fixed precision vectors of configurable length. The accelerator is designed leveraging the new open-source Matchlib SystemC library, originally developed by NVIDIA and now included with the Catapult HLS release.

Specifically, we use the following small kernel of computation.

// MAC
int *data_in = new int[mac_vec][mac_len];
int *acc  = new int[mac_vec];
for (int iterations = 0; iterations < mac_n; iterations++) {
	load_input_data(data_in);
	for (int j = 0; j < mac_vec; j++) {
		acc[j] = 0;
		for (int i = 0; i < mac_len; i += 2)
			acc[j] += data_in[j][i] * data_in[j][i+1]
	}
	store_output_data(acc)
}


This is a simple example that will introduce users to the automation mechanisms offered by ESP. The tutorial, instead, will not explore the capabilities of the HLS tool for design-space exploration.

Note: The users have access to prebuilt material to run the tutorial on an FPGA, without executing all the subsequent steps. See the ‘FPGA prototyping with prebuilt material’ section at the end of this guide.

Back to top

Accelerator skeleton

ESP provides an interactive script that generates all of the hardware and software sockets to quickly integrate a new accelerator in a full SoC. Across the differenet design flows, the script provides the same set of options for generating the appropriate skeleton.

Note: The script and the generated skeleton can be very helpful to the accelerator designer. The generated skeletons have a simple structure, and it is left to the designers to modify them to their needs. The script may generate an incorrect skeleton if unsupported inputs are entered, e.g. input data size equal to zero. You can verify the correctness of the skeleton by testing it as described in the rest of the guide, before editing it.

# Move to the ESP root folder
cd <esp>
# Run the accelerator initialization script and respond as follows
./tools/accgen/accgen.sh
=== Initializing ESP accelerator template ===

  * Enter accelerator name [dummy]: mac
  * Select design flow (Stratus HLS, Vivado HLS, hls4ml, Catapult HLS) [S]: C
  * Enter ESP path [/space/esp-master]:
  * Enter unique accelerator id as three hex digits [04A]: 056
  * Enter accelerator registers
    - register 0 name [size]: mac_len
    - register 0 default value [1]: 64
    - register 0 max value [64]:
    - register 1 name []: mac_vec
    - register 1 default value [1]: 100
    - register 1 max value [100]:
    - register 2 name []: mac_n
    - register 2 default value [1]:
    - register 2 max value [1]: 16
    - register 3 name []:
  * Configure PLM size and create skeleton for load and store:
    - Enter data bit-width (8, 16, 32, 64) [32]:
    - Enter input data size in terms of configuration registers (e.g. 2 * mac_len}) [mac_len]: mac_len * mac_vec
      data_in_size_max = 6400
    - Enter output data size in terms of configuration registers (e.g. 2 * mac_len) [mac_len]: mac_vec
      data_out_size_max = 100
    - Enter an integer chunking factor (use 1 if you want PLM size equal to data size) [1]:
      Input PLM has 6400 32-bits words
      Output PLM has 100 32-bits words
    - Enter number of input data to be processed in batch (can be function of configuration registers) [1]: mac_n
      batching_factor_max = 16

=== Generated accelerator skeleton for mac ===


You can find a description of the parameters configured by the accelerator initialization script in the section “Accelerator skeleton” of the guide for the SystemC flow with Stratus HLS.

Executing the initialization script with the above parameters generates the accelerator source files and testbench in SystemC, together with the HLS scripts. These files are located at the path accelerators/catapult_hls/mac_sysc_catapult/hw.

In addition, the accelerator’s device driver, bare metal application, and user-space linux application are generated at the path accelerators/catapult_hls/mac_sysc_catapult/sw.

# Complete list of files generated and modifie
<esp>/accelerators/catapult_hls/mac_sysc_catapult
├── hw
│   ├── mac_sysc.xml          # Accelerator description and register list
│   ├── hls                   # HLS scripts
│   │   ├── build_prj.tcl     # Synthesis script
│   │   ├── build_prj_top.tcl # Synthesis script configuration
│   │   ├── rtl_sim.tcl       # RTL simulation script
│   │   └── Makefile
│   ├── inc                  # Folder for code header files
│   │   ├── mac_conf_info.hpp  # Configuration class definition
│   │   ├── mac_data_types.hpp # Accelerator's data-type specifications
│   │   ├── mac_specs.hpp      # Accelerator's scratchpads specifications
│   │   ├── mem_wrap.hpp       # Memory wrapper for Matchlib scratchpads PLMs
│   │   ├── mac_specs.hpp      # Accelerator's scratchpads specifications
│   │   └── mac.hpp            # ESP accelerator definition and memory binding
│   │    mem_bank           # Single memory bank models and libraries
│   ├── src                    # Accelerator source files
│   │   └── mac.cpp            # Main SystemC processes description (config, load, compute_read
│   │       		       	      	      			     compute, store_read, store)
│   └── tb                     # SystemC testbench 
│       ├── sc_main.cpp        
│       ├── system.hpp         
│       ├── testbench.cpp      
│       └── testbench.hpp      
└── sw
    ├── baremetal            # Bare-metal test application
    │   ├── mac.c
    │   └── Makefile
    └── linux
        ├── app              # Linux test application
        │   ├── mac.c
        │   └── Makefile
        ├── driver           # Linux device driver
        │   ├── mac_sysc_catapult.c
        │   ├── Kbuild
        │   └── Makefile
        └── include
            └── mac_sysc_catapult.h

In this tutorial, the design style adopted for the accelerator’s structure is strongly driven by the latency insensitive Matchlib Connections and Matchlib Scratchpad APIs. In order to better pipeline and unroll the computation kernel, the read requests to the local PLMs are split across two processes. Looking at the compute phase, for example, the read address request is pushed to the scratchpad interface from the compute_read process via a dedicated Matchlib Connection, while the corresponding read response is popped from the compute process, as shown below.

// SCRATCHPAD READ ACCESS API

void mac::compute_read() {
     ...
     if (ping_pong)
     	in_ping_ra.Push(rreq);
     else
        in_pong_ra.Push(rreq);
     ...
}

void mac::compute() {
     ...
     FPDATA_WORD data;
     if (ping_pong)
        data=in_ping_rd.Pop().data[0];
     else
        data=in_pong_rd.Pop().data[0];
     ...
}

This applies to any of the processes which perform read accesses to PLMs that are mapped to Matchlib scratchpads. In the ESP accelerator model, this is the case for the the compute and store processes. As a consquence, we deviate from the standard 4-process structure (configure, load, compute and store) adopted for the other flows supported by ESP, and shift to a 6-process structure :configure, load, compute_read, compute, store_read and store.


Back to top

Accelerator behavior implementation

Similarly to the SystemC Stratus HLS flow, this step consists in editing the compute portion of the ESP accelerator skeleton.

The source file to edit for the MAC accelerator is generated at the path accelerators/catapult_hls/mac_sysc_catapult/hw/src/mac.cpp. In this file, locate the definition of the SystemC process compute(). Scroll down to the code section marked by the comment // Compute kernel and // End Compute Kernel and replace it with the following.

// Compute Kernel
		
FPDATA acc_fx=0;
int vec_indx=0;
int vec_num=0;

#pragma hls_pipeline_init_interval 2
#pragma pipeline_stall_mode flush
for (int  i=0; i < in_len; i+=2)
{
    FPDATA_WORD op[2];
    
    if (ping_pong){
    op[0]=in_ping_rd.Pop().data[0];
    op[1]=in_ping_rd.Pop().data[0];
    }
    else{
    op[0]=in_pong_rd.Pop().data[0];
    op[1]=in_pong_rd.Pop().data[0];
    }

    FPDATA op1_fx=0;
    FPDATA op3_fx=0;

    int2fx(op[0],op1_fx);
    int2fx(op[1],op3_fx);

    // Multiply and accumulate
    acc_fx+=op1_fx * op3_fx;

    vec_indx+=2;

    // Write accumulated result
    if (vec_indx == mac_len){
       FPDATA_WORD acc=0;
       fx2int(acc_fx,acc);
       plm_WR<out_as, outwp> rreq;

       rreq.indx[0]=vec_num;
       rreq.data[0]=acc;

       if (out_ping_pong)
          out_ping_w.Push(rreq);
       else
          out_pong_w.Push(rreq);

       vec_num++;
       vec_indx=0;
       acc_fx=0;
    }
}
		
// End Compute kernel


NOTE: The prebuilt material contains the complete source code of the MAC accelerator.

Without editing the code, the generated accelerator implements an identity function that moves data from input to output. With respect to this skeleton, the snippet above implements the following changes.

  • The inner loop multiplies two elements of the input data per iteration (i += 2).
  • A new variable acc is used to accumulate over one vector of length in_len.
  • Writes to the output memory occur only when the accumulation for one vector is completed.
  • Two additional counters, vec_indx and vec_num, are used to keep track of the position in the current vector and of the number of the vector that is being processed.
  • Switching between the output ping-pong buffers occurs at a different rate with respect to input buffers. Hence a new variable out_ping is used to control which output buffer should be written to.

Please take a moment to understand and get familiar with these changes in the code above.

Back to top

Testbench implementation

The testbench code is generated at the path accelerators/catapult_hls/mac_sysc_catapult/hw/tb/. To complete and specialize it for the target accelerator, open testbench.cc and locate the computation of the golden output array gold. Replace the default initialization code with the following.

for (int i = 0; i < mac_n; i++)
        for (int j = 0; j < mac_vec; j++)
	{
           gold[i * out_words_adj + j] = 0;
           FPDATA acc=0;
           for (int k = 0; k < mac_len; k += 2)
           {
              FPDATA data1;
              FPDATA data2;
	      int2fx(in[i * in_words_adj + j * mac_len + k],data1);
	      int2fx(in[i * in_words_adj + j * mac_len + k + 1],data2);
	      acc+=data1*data2;
	   }
	   FPDATA_WORD acc_int;
	   fx2int(acc,acc_int);
	   gold[i * out_words_adj + j] =acc_int;
	}
																								    


For the purpose of this tutorial, the input array can be initialized with any dataset, including random numbers. However, ensure the MAC computation will not overflow the fixed point representation to avoid validation errors.

Back to top

Simulation and RTL implementation

The HLS script is fully generated at the path accelerators/vivado_hls/mac_sysc_catapult/hw/hls, and it defines synthesis directives for all of the FPGAs supported by ESP. For behavioral simulation, HLS, and post-HLS RTL simulation, the user must specify the environment variable DMA_WIDTH to select the apporpriate version for the chosen SoC architecture (32 bits if using the Leon3 or Ibex cores, 64 bits if using the Ariane core).

Choose one of the supported boards to create your new SoC instance. Design paths in this tutorial refer to the Xilinx VC707 evaluation board and to a target SoC with a 64-bit architecture, but all instructions are valid for any of the supported boards and any system architecture

After creating the MAC accelerator, ESP discovers it in the library of components and generates a set of make targets for it.

# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t

# Run behavioral simulation
DMA_WIDTH=64 make mac_sysc_catapult-exe

# Generate RTL with HLS
DMA_WIDTH=64 make mac_sysc_catapult-hls

# Simaulate RTL implementation
DMA_WIDTH=64 make mac_sysc_catapult-sim

Back to top

Accelerator debug

At every simulation stage, you may encounter issues that require debugging.

If the simulation output is incorrect at the behavioral level, you can debug your implementation as you would debug any C++ program. If you need to change compile flags, in order to run a debugger, you can do so by modifying the Makefile located at accelerators/catapult_hls/common/systemc.mk.

In case of simulation errors during the RTL simulation that do not occur in behavioral simulation, you can leverage RTL simulator to visualize waveforms. In order to do so, you will just need to change the RTL-simulation command located at accelerators/catapult_hls/mac_sysc_catapult/hw/hls-work-virtex7/rtl_sim.tcl, by substituting SIMTOOL=msim sim with SIMTOOL=msim simgui in order to open the simulator GUI. After that, by rerunning the target for RTL simulation as specified above, you can debug the design using the waveforms from the simulator’s GUI.


Back to top

2. Accelerator customization

Matchlib Scratchpad management

In this section, we show how to parallelize the scratchpad memory accesses in order to improve the kernel execution latency. In the specific case of our MAC accelerator, we may want to get the two values to multply and accumulate in the same clock cycle. In order to do so, the scratchpad must be split into 2 memory banks so that 2 words can read in the same cycle. The Matchlib scratchpad APIs provide the user with a dedicated parameter to decide how many memory banks the scratchpad should be mapped to. Once this value is set, the decoding logic necessary to address the target value and avoid banks conflict is automatically inferred.

Open the specs file located at accelerators/catapult_hls/mac_sysc_catapult/hw/inc/mac_specs.hpp and set both PLM_IN_RP (# read ports of input PLM) and inbks (# memory banks of input PLM) to 2.

  • From accelerators/catapult_hls/mac_sysc_catapult/hw/src open mac.cpp , locate the comment //Send read memory requests to input PLM and apply to following changes to the corresponding loop:
    • change the i iterator update rule to i+=2
    • replace rreq.indx[0]=i; with
      #pragma hls_unroll yes
      for (uint16_t k = 0; k < 2; k++)
       rreq.indx[k]=i+k;
      
  • From accelerators/catapult_hls/mac_sysc_catapult/hw/src open mac.cpp , locate the comment //Retrive read memory responses from input PLM and apply to following changes to the corresponding loop:
    • change the loop II (initiation interval) to 1, as a result of the memory accesses parallelization.
    • Substitute the following code section:
      if (ping_pong){
         op[0]=in_ping_rd.Pop().data[0];
         op[1]=in_ping_rd.Pop().data[0];
      }
      else{
         op[0]=in_pong_rd.Pop().data[0];
         op[1]=in_pong_rd.Pop().data[0];
      }
      

      with:

      plm_RRs<inrp> rrsp;
      if (ping_pong)
         rrsp=in_ping_rd.Pop();
      else
         rrsp=in_pong_rd.Pop();
      #pragma hls_unroll yes
      for (uint16_t k = 0; k < 2; k++)
         op[k]=rrsp.data[k];
      

Fixed point precision

Given the target accelerator data-width that you specified in the initialization script, the fixed point rapresentation used by the accelerator is by default assigning half of the bits for the integer part and half for the fractional part. If you want to change the representation to increase precision for a specific range of numbers, this can be done by changing FPDATA_IL at accelerators/catapult_hls/mac_sysc_catapult/hw/src/mac_specs.hpp to the desired value.

3. Accelerator integration

The integration flow is identical for both the C/C++ and the SystemC flows. Please refer to Part 2 of the guide How to: design an accelerator in SystemC.

Back to top

User application implementation

In this tutorial we select the RISC-V Ariane core and use the corresponding paths to the software source code. Please note, however, that all instructions are valid for Leon3 and Ibex systems, as well.

Both baremetal and Linux test applications for the MAC accelerator are generated at the path <esp>/accelerators/catapult_hls/mac_sysc_catapult/sw. To complete them, you need to apply the same edit to both the baremetal and Linux applications. The changes consist of initializing inputs and golden outputs, similarly to what is done for the SystemC testbench.

Move to the path <esp>/accelerators/stratus_hls/mac_sysc_catapult/sw/baremetal, open mac.c and locate the init_buf() function and replace its body with the following code.

int i;
int j;
int k;
float out_gold;

        for (i = 0; i < mac_n; i++){
           for (j = 0; j < mac_len * mac_vec ; j++){
              float data = ((i * 8 + j + k) % 32) + 0.25;
              token_t data_fxd = float_to_fixed32(data, 16);
              in[i * in_words_adj + j] = data_fxd;
           }
           k++;
        }

        for (i = 0; i < mac_n; i++)
           for (j = 0; j < mac_vec; j++) {
              out_gold = 0;
              for (k = 0; k < mac_len; k += 2){
                 float data1=fixed32_to_float(in[i * in_words_adj + j * mac_len + k], 16);
                 float data2=fixed32_to_float(in[i * in_words_adj + j * mac_len + k + 1],16);
                 out_gold += data1*data2;
              }
           gold[i * out_words_adj + j]= float_to_fixed32(out_gold,16);
        }


Now move to <esp>/accelerators/stratus_hls/mac_sysc_catapult/sw/linux/app, open mac.c and replace the body of init_buffer() with the same code shown above.

Note: this code is just a port to C of the C++ code used for the SystemC testbench.

Back to top

SoC configuration

The final steps of the tutorial coincide with those presented in the tutorial about designing a single core SoC. We recommend you review those steps if you are not familiar with ESP.

# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t


Follow the “Debug link configuration” instructions from the “How to: design a single-core SoCguide. Then configure the SoC using the ESP configuration GUI.

# Run the ESP configuration GUI
make esp-xconfig


Select Ariane in the “CPU Architecture” frame and disable the caches from the “Cache configuration” frame. Select a 2x2 layout and set 1 memory tile, 1 processor tile, 1 I/O tile and 1 MAC tile. The implementation for MAC will default to _dma64.

Back to top

RTL simulation

Users can run a full-system RTL simulation of the MAC accelerator driven by the baremetal application running on the processor tile.

The bare-metal simulation is slow; to shorten it, you may want to reduce the default values of mac_len and mac_vec in the bare-metal C application.

# Compile baremetal application
make mac_sysc_catapult-baremetal

# Modelsim
TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_sysc_catapult.exe make sim[-gui]

# Incisive
TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_sysc_catapult.exe make ncsim[-gui]


<cpu> corresponds to ariane because we selected the Ariane core in the “SoC Configuration” step.

Back to top

FPGA prototyping

Follow the “FPGA prototyping” instructions from the “How to: design a single-core SoCguide.

The only difference is that, just like for the RTL simulation, you need to specify the TEST_PROGRAM variable when launching the bare-metal test on FPGA:

TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_stratus.exe make fpga-run

To test the Linux application, run the following commands after logging into Linux from the serial connection to the ESP instance running on FPGA:

$ cd /applications/test/
$ ./mac_sysc_catapult.exe

====== mac_sysc_catapult.0 ======

  .mac_n = 1
  .mac_vec = 100
  .mac_len = 64

  ** START **
  > Test time: 13575640 ns
    - mac_sysc_catapult.0 time: 1134480 ns

  ** DONE **
+ Test PASSED

====== mac_sysc_catapult.0 ======

Back to top

FPGA prototyping with prebuilt material

With the provided prebuilt material, you can run the tutorial on FPGA directly. Each packet is marked with the first digits of the Git revision it was created and tested with.

The packet contains the following:

  • The source code, testbench and HLS scripts for the MAC accelerator (accelerators/catapult_hls/mac)
  • The bare-metal test application and the Linux device driver and test application for the MAC accelerator (soft/[ariane|leon3]/drivers/mac)
  • Two working folders for Xilinx VCU118 and Xilinx VC707, each including: - The Linux image (linux.bin) - The Baremetal application (mac.bin) - The boot loader image (prom.bin) - The FPGA bitstream (top.bit) - The hidden configuration files for the design (.grlib_config and .esp_config) - A script to run the design on FPGA (runme.sh)

Decompress the content of the packet from the ESP root folder to make sure all files are extracted to the right location.

cd <esp>
tar xf ESP_SystemcAcc_GitRev.ddaca94.tar.gz


Enter one of the soc instances extracted from the packet.

cd socs/systemc_acc_vc707


Follow the “UART interface” instructions from the “How to: design a single-core SoCguide, then launch the runme.sh script

# Execute baremetal test
./runme.sh mac
# Boot Linux
./runme.sh


Finally From the ESP Linux terminal run the MAC test application

$ cd /applications/test/
$ ./mac.exe


Back to top