Guide – How to: design a partially-reconfigurable SoC

Latest update: 2023-04-17 Dynamic partial reconfiguration (DPR) enables to modify a portion of the FPGA logic at runtime, without requiring to reload the full bistream. In the context of ESP, DPR allows to swap accelerators on the fly. ESP accelerator tiles are partially reconfigurable, while the other types of tiles are designated as the static.

DPR also enables incremental compilation of ESP SoC designs during iteartive FPGA implementaions, where only some of the accelerators of the SoC are modified in each iteration. The incremental flow detects the newly modified reconfigurable tiles in the SoC design and generates new partial bitstreams (PBS) only for them. Incremental compilation allows to reduce design recompilation time.

This guide introduces the ESP DPR FPGA flow and runtime swapping of accelerators using baremetal and Linux applications. It also illustrates the incremental compilation for accelerators inside ESP SoCs.

The DPR flow was fully tested on the Xilinx VC707 evalaution board. The development for the VCU118 and VCU128 targets is still in progress.

1. DPR on a single accelerator SoC
2. Runtime reconfiguration
- Baremetal applications
- Baremetal Application Skeleton

1. DPR on a single accelerator SoC

The first part of the tutorial covers the DPR flow with a default SoC configuration, which consists of a 2x2 mesh with one processor tile, one memory tile, one auxiliary tile, and a single accelerator tile. This guide illustrates DPR using the Adder accelerator from the ESP repo and the MAC accelerator that was designed using the C/C++ Vivado HLS flow. Both accelerators can be found inside <esp>/accelerators/vivado_hls/. ESP supports DPR for SoCs with Leon3 or Ariane processor cores.

This tutorial assumes that system designers are familiar with the ESP infrastructure and know how to run basic make targets to create a simple instance of ESP, integrating just a single core and the various ESP accelerator design flows. Please make sure the DPR environment is also set correctly before running this guide.

The DPR flow is executed in two phases. In the first phase we create an SoC with an instance of the MAC accelerator and implement it on the FPGA. The first phase compiles the entire SoC (the CPU, mem, IO tiles together with the MAC accelerator). This phase ends by generating full bitstream for the full SoC and a partial bitstream for the MAC accelerator.

The second phase is an incremental compilation of the design, where the MAC is replaced with the ADDER accelerator followed by the reimplementation of the design. This incremental compilation implementation though, unlike the first one, only compiles the ADDER accelerator. This is true for any subsequent impelementations that modify only the accelerator tile.

HLS and SoC configuration

For this tutorial we target the popular Xilinx VC707 evaluation board based on the Virtex Ultrascale Plus FPGA.

# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t

Then generate the RTL for both accelerators inside the design folder using the following make targets.

# Generate the RTL for both accelerators
make adder_vivado-hls
make mac_vivado-hls 

When the RTL generation finishes, the SoC is first configured with an instance of MAC accelerator. The SoC configuration can be visualized and modified with the ESP configuration GUI:

make esp-xconfig

Note: Make sure you check the Enable Partial-reconf button on the GUI.

FPGA implementation

The FPGA compilation of the SoC is fully automated usign a single make target.

make vivado-syn-dpr

This make target starts off by replacing the accelerators inside the static part with black-box instances and then synthesizes the staic part and the reconfigurable accelerator RTL using separate Vivado instances. It then automates the DPR floorplanning using an open-source tool FLORA. Finally, it performs place and route (P&R) and bitstream generation by plugging the synthesized accelerator netlist into the static netlist.

At the end of this step, the full bitstream of the design is located at <esp>/socs/xilinx-vc707-xc7vx485t/vivado_dpr/Bitstreams/acc_bs.bit, while the PBS for the MAC accelerator is located inside <esp>/socs/xilinx-vc707-xc7vx485t/partial_bitstreams/mac_vivado_<tile_id>.bin.

# List of files generated and modified by the DPR flow
<esp>/socs/<soc_design_folder>
├── partial_bitstreams
│   └──  mac_vivado_<tile_id>.bin  # accelerator PBS
│        
├── socgen             
│   └── esp            
│       └── pbs_map.h  #PBS descriptor
│               
└── vivado_dpr
    ├── Synth       #folder for synthesized RTL of the design
    │   ├── Static              #synthesized static netlist
    │   └── mac_vivado_<tile_id>     #synthesized acc netlist  
    │   
    ├── Implement    
    │   └── top_dpr  #routed static part  
    │   
    └── Bitstreams
        └── acc_bs.bit  #full bitstream of the design

The flow also generates a PBS descriptor (a struct named bs_descriptor) inside <esp>/socs/xilinx-vc707-xc7vx485t/socgen/esp/pbs_map.h that enables software to manage PBS at runtime. The descriptor contains specific information about the name, size, and tile_id of the PBS.

#define LEN_DEVNAME_MAX 32
typedef struct pbs_map {
    char name [LEN_DEVNAME_MAX]; //name_of_accelerator   
    unsigned pbs_size; //size of *PBS* in bytes
    unsigned long long pbs_addr; //*PBS* addr offset in memory
    unsigned pbs_tile_id; //accelerator tile_id
}pbs_map;

Incremental compilation

Following the successful implementation of the first SoC instance, subsequent accelerators are compiled using the incremental flow. To do so, MAC is replaced with Adder in the accelerator tile and the design is recompiled.

Once again launch the ESP configuration GUI and replace MAC with Adder inside the accelerator tile. Then run the make target for an incremental compilation

# Make target for incremental compilation
make vivado-syn-dpr-acc

Note: The make target for incremental DPR compilation is slightly different from the one for a full DPR compilation.

The incremental FPGA flow performs the following:

Compares the old and the new SoC configurations to detect the newly modified accelerator tiles.
Performs RTL synthesis only for the modified accelerators.
Checks if the old accelerator tile contains enough resources for the new one.
- Best case scenario –> If there are enough resources, then it performs P&R only on the new accelerator RTL on top of the previously pre-routed static part and generates PBS for the new accelerators.
- Worst case scenario –> If there are not enough resources, then it creates a new floorplan for all accelerator tiles (modified and unmodified) and performs P&R on the full SoC (including the static part).

Note: Even in the worst case, the incremental compilation reduces the design runtime by avoiding to re-synthesize the static part and the unmodified accelerators.

This step finishes by adding <esp>/socs/xilinx-vc707-xc7vx485t/partial_bitstreams/ with the newly generated Adder PBS
and updating the PBS descriptor inside <esp>/socs/xilinx-vc707-xc7vx485t/socgen/esp/pbs_map.h.

2. Runtime reconfiguration

The auxilary tile integrates a partial reconfiguration controller (PRC) IP from Xilinx, that is responsible for loading accelerator PBS at runtime. The ESP software stack provides baremetal and Linux device driver APIs to control the reconfiguration from software. The drivers are also seamlessly compiled together with the applications.

Enabling DPR requires minimal modification to the source code of the baremetal or Linux applications that are generated in the various accelerator design flows.

Baremetal applications

In order to swap accelerators at runtime, baremetal applications need to include a single header file and a single driver API call.

The reconfiguration is triggered using the reconfigure_FPGA function call from the driver API:

int reconfigure_FPGA(struct esp_device *dev, int pbs_id);

On a high-level the function initializes and configures the PRC based on the PBS descriptor, decouples the accelerator tile, and finally starts the reconfiguration. The return value of the function signifies the success of the reconfiguration.

The following is a skeleton of an example baremetal application that can be used to switch between the Adder and MAC accelerators that were designed in this guide. The full C source of the application can be found in <esp>/soft/common/apps/baremetal/dpr_multi_acc/dpr_multi_acc.c

Baremetal Application Skeleton

#include “prc_utils.h”  //Contains DPR-related declarations and definitions

#define DEV_NAME_ADDER "sld,adder_vivado"  //name of the adder accelerator in ESP
#define DEV_NAME_MAC "sld,mac_vivado"	//name of the mac accelerator in ESP

int main(int argc, char * argv[])
{ 
    struct esp_device *dev_tile_1;
    
    //Probe the tile
    probe(&dev_tile_1, VENDOR_SLD, SLD_ACC_TILE_1, DEV_NAME_ADDER);

    //Load the adder PBS
    reconfigure_FPGA(dev_tile_1, 0);
	
    /*<invoke Adder accelerator and wait until done>*/
     
    //Load the MAC PBS
    reconfigure_FPGA(dev_tile_1, 1);

    /*<invoke MAC accelerator and wait until done>*/
}