Guide – How to: use multicasting via P2P communication
Latest update: 2025-01-08
This guide illustrates how to create an instance of ESP with multiple dummy accelerators and how to develop a multicasting application that delivers data from the producer to the consumer via P2P communication using the baremetal application and ESP API. We will focus on how to reconfigure at run time the coherence model and packetization for the producer invocation and how to setup multicasting communication.
- 1. Multi-dummy-accelerator SoC
- 2. ESP user-space baremetal application
- 3. ESP user-space API
- 4. FPGA execution
For this tutorial, we assume that you are already familiar with the ESP build infrastructure and at least with the steps presented in the guide How to: design a single-core SoC. For better understanding, we recommend that you read the guide How to: design an accelerator in SystemC as well.
In this guide and in the corresponding prebuilt
material we use the example application dummy_multicast_p2p
,
which drives 17 identical dummy accelerators (1 producer and up to 16 consumers). This
example invokes multicasting multiple times to confirm the validity of multicasting
with testing, where each test is performed with a different source/destination configuration:
- Dummy Multicast P2P - Single producer multicasting to 16 consumers using non-coherent or (LLC-coherent) DMA
- Dummy Multicast P2P Packetized - Three producers simultaenously multicasting to 5, 5, and 4 consumers using packetization via non-coherent (or LLC-coherent) DMA P2P communication
- Dummy Multicast YX - JZ
1. Multi-dummy-accelerator SoC
Generate accelerator RTL
Enter a working folder for one of the supported FPGAs.
cd <esp>/socs/xilinx-vcu128-xcvu37p
This tutorial leverages the dummy accelerator, which is designed in SystemC and
implemented with Cadence Stratus HLS.
Generate the dummy accelerator for the target FPGA by running the HLS target
make dummy_stratus-hls
SoC configuration
Once HLS completes, the ESP configuration GUI will find the RTL available in four flavors: 64-bits, 128-bits, 256-bits, and 512- bits. Please use the implementation that matches the NoC bitwidth, which is the default configuration in the GUI.
Open the ESP configuration GUI.
make esp-xconfig
Select a NoC configuration of 5x4 and add 17 dummy accelerators, one Ariane processor core, one I/O tile and one memory tile. Now, select “DMA NoC Planes (4,6) Bitwidth greater than 256 bits, because we have up to 16 multicast destinations. 64, 128, 256, and 512 bits support up to 4, 11, 25, and 32 maximum multicast destinations, respectively. To enable multicasting, check “Enable Multicast on DMA Planes.” This option will insert multicast routers instead of unicast routers. Select 16 under the dropdown from “Maximum Multicast Destinations:” For packetization, NoC router FIFO queue depth must be an integer multiple of packet size to guarantee deadlock-free simultaneous multicasting. For better performance, increase the FIFO queue depth and to save area, select lower number for FIFO queue depth. The default is 4 and we recommend the default. In order to test the coherence reconfiguration, we must enable the ESP cache hierarchy by enabling “Use Caches” in the cache configuration tab. For faster logic synthesis and to one HLS run, we suggest to use the “SystemVerilog” implementation of the caches.
For your convenience, the ESP configuration file corresponding to this tutorial
is available at the path <esp>/soft/common/apps/examples/dummy_multicast_p2p/esp_config
.
You can apply this configuration by overwriting the hidden configuration file in
the current working folder and then configure ESP in batch mode.
cp <esp>/soft/common/apps/examples/dummy_multicast_p2p/esp_config <esp>/socs/xilinx-vcu128-xcvu37p/socgen/esp/.esp_config
make esp-config
Now you may open the configuration GUI and double check all settings.
Finally, you may launch logic synthesis with Vivado.
# From a clean working folder
make vivado-syn
2. ESP user-space baremetal application
Example baremetal application - single multicast
The example baremetal application is located at the path
<esp>/soft/common/apps/baremetal/dummy_multicast_p2p
.
First you should open and analyze the c file dummy_multicast_p2p.c
.
This file includes all run-time configurations necessary to execute
baremetal multicasting. TOKENS
, BATCH
, mask
, MCAST_PACKET
,
MCAST_PACKET_SIZE
, NDESTS
, and PRODUCER_ACC_NUM
.
TOKENS
indicates how many 64-bit data words the producer multicasts
to the consumers in one packet. BATCH
indicates how many packets the
producer multicasts to the consumers. Every consumer-producer synch-
ronization is followed by one packet multicast.
MCAST_PACKET
indicates whether the multicast is one long packet or
multiple packetized smaller packets. MCAST_PACKET_SIZE
indicates
how many flits (including head and tail) are in one packetizaed packet.
NDESTS
indicates the number of multicast destinations.
PRODUCER_ACC_NUM
indicates the producer accelerator number for the
multicast.
COHERENCE
indicates the cache coherence mode.
The accelerator number starts from 0 (top left tile) increasing to the right first then bottom – bottom right tile has the highest accelerator number.
Closely examine the flow of the baremetal application.
We have implemented the multicast application by initiating malloc
aligned_malloc
, populating page table ptable
, initializing data on
memory buffers init_buf
, setting up each accelerator for multicast via
p2p p2p_setup
, initiating multicasting via status configuration
registers CMD_MASK_START
, and validating the memory buffer validate_dummy
for every accelerator used during multicasting.
<esp>/socs/xilinx-vcu128-xcvu37p
In the appropriate FPGA folder, compile the baremetal application.
make dummy_multicast_p2p-baremetal
After the baremetal compilation, you can observe the RTL behavior.
TEST_PROGRAM=./soft-build/<cpu_name>/baremetal/dummy_multicast_p2p.exe make sim-gui
When done, the baremetal application should print either “PASS” or “FAIL” in the RTL simulation.
Please refer to our TBD 2025 paper to get information about accelerator multicasting in ESP.
Please refer to our ASPDAC 2019 paper to get information about accelerator coherence in ESP.
The parallel configuration selects a different coherence model for each accelerator and then runs them all in parallel. All dummy accelerators operate on a different region in memory on independent data sets.
Note: when using multicasting, which uses point-to-point communication, all accelerators in the chain must be configured to use the same coherence model. This can be non coherent, LLC coherent, or coherent with recalls, but not fully coherent. Point-to-point communication occurs without involving the cache hierarchy, thus the fully-coherent configuration cannot be used. Please refer to the DATE 2020 paper and the ASPDAC 2019 paper for more details.
Example baremetal application - packetized multicast
The example baremetal application is located at the path
<esp>/soft/common/apps/baremetal/dummy_multicast_p2p_packetized
.
First you should open and analyze the c file dummy_multicast_p2p_packetized.c
.
This file includes all run-time configurations necessary to execute
baremetal multicasting. TOKENS
, BATCH
, mask
, MCAST_PACKET
,
MCAST_PACKET_SIZE
, NDESTS
, and PRODUCER_ACC_NUM
.
TOKENS
indicates how many 64-bit data words the producer multicasts
to the consumers in one packet. BATCH
indicates how many packets the
producer multicasts to the consumers. Every consumer-producer synch-
ronization is followed by one packet multicast.
MCAST_PACKET
indicates whether the multicast is one long packet or
multiple packetized smaller packets. MCAST_PACKET_SIZE
indicates
how many flits (including head and tail) are in one packetizaed packet.
NDESTS
indicates the number of multicast destinations.
PRODUCER_ACC_NUM
indicates the producer accelerator number for the
multicast.
COHERENCE
indicates the cache coherence mode.
The accelerator number starts from 0 (top left tile) increasing to the right first then bottom – bottom right tile has the highest accelerator number.
Closely examine the flow of the baremetal application. Now, each producer is configured to its own consumer and the application is configured to test all configurations of the current producer-consumer configuration. With packetization, all 14 consumers will initiate data request to the producers and the three producers will multicast simultaneously.
<esp>/socs/xilinx-vcu128-xcvu37p
In the appropriate FPGA folder, compile the baremetal application.
make dummy_multicast_p2p_packetized-baremetal
After the baremetal compilation, you can observe the RTL behavior.
TEST_PROGRAM=./soft-build/<cpu_name>/baremetal/dummy_multicast_p2p_packetized.exe make sim-gui
When done, the baremetal application should print either “PASS” or “FAIL” in the RTL simulation.
Please refer to our TBD 2025 paper to get information about accelerator multicasting in ESP.
Please refer to our ASPDAC 2019 paper to get information about accelerator coherence in ESP.
The configuration selects a coherence model for all accelerators. All dummy accelerators operate on a different region in memory on independent data sets.
Note: when using multicasting, which uses point-to-point communication, all accelerators in the chain must be configured to use the same coherence model. This can be non coherent, LLC coherent, or coherent with recalls, but not fully coherent. Point-to-point communication occurs without involving the cache hierarchy, thus the fully-coherent configuration cannot be used. Please refer to the DATE 2020 paper and the ASPDAC 2019 paper for more details.
3. ESP user-space API
Example application - single multicast
The example application is located at the path
<esp>/soft/common/apps/examples/dummy_multicast_p2p
.
First you should open and analyze the configuration header file
dummy_multicast_p2p_cfg.h
. This file includes all run-time configurations that
we wish to execute using the ESP API. Each configuration is defined as
an esp_thread_info_t
, which is an array of size corresponding to the
number of accelerators involved in a particular invocation. The
esp_desc
field points to a structure containing the configuration
parameters for the dummy accelerator: struct dummy_stratus_access
.
For instance, the configuration dummy_cfg_p2p
lists all 17 dummy
accelerators, with devname dummy_stratus.0
, dummy_stratus.1
to
dummy_stratus.16
, respectively. This configuration runs all 17
accelerators concurrently. All accelerators have same coherence model:
non coherent or LLC coherent coherent. Please refer to our ASPDAC 2019
paper to get
information about accelerator coherence in ESP.
The configuration, dummy_cfg_p2p
, configures accelerators to
operate in chain by using point-to-point communication. The producer
dummy, dummy_stratus.producer_acc_num
, reads data from memory
(.src_offset = producer_acc_num * TOKENS * BATCH * sizeof(token_t)
in the structure the esp_desc
field points to), but then sends the
result directly to another accelerator (.esp.p2p_store = 1
).
Other dummies,dummy_stratus.0 - 16
, leverages point-to-point for
both read and write operations. Instead of requesting data from memory,
it requests data from dummy_stratus.producer_acc_num
(.esp.p2p_srcs = {"dummy_stratus.producer_acc_num", "", "", ""}
) and
stores the results back to memory. Note that ESP supports a maximum fan-in
of 4 accelerators for point-to-point communication. For single multicast,
packetization is disabled (mcast_packet = 0
).
Note: when using multicasting, which uses point-to-point communication, all accelerators in the chain must be configured to use the same coherence model. This can be non coherent, LLC coherent, or coherent with recalls, but not fully coherent. Point-to-point communication occurs without involving the cache hierarchy, thus the fully-coherent configuration cannot be used. Please refer to the DATE 2020 paper and the ASPDAC 2019 paper for more details.
The main application dummy_multicast_p2p.c
is based on the multicast unit test.
The application will multicast to the configurable number of accelerators from
configurable producer using defined macros in the header file.
We have extended such application by repeating the input data initialization,
the accelerator invocation and the output validation for every configuration
present in the header file.
The application reuses the same memory buffer for all configurations; that is we
only call esp_alloc()
once. Depending on the test, we prepare the data for
all 17 dummy accelerators. After multicasting, the saved data in the memory are
compared against the golden output and computed for any error.
To run a configuration, we simply need to call esp_run(<cfg>, <number of
accelerators>)
. The ESP library spawns a thread for each accelerator, copies
the configuration into the accelerators registers and runs them in parallel.
Once all threads terminate, esp_run()
returns to the caller, which can safely
run validation.
Example application - packetized multicast
The example application is located at the path
<esp>/soft/common/apps/examples/dummy_multicast_p2p_packetized
.
The header file is almost identical to the dummy_multicast_p2p
example in the previous section. Now, consumers are configured to
three different producers. Also, because there are three different
producers, packetization parameters are turned on, with the
packetization packet size matching the size of the DMA NoC plane
FIFO queue depth.
The main application dummy_multicast_p2p_packetized.c
is identical to
dummy_multicast_p2p.c
in the previous example. Now, the functions
init_buffer()
and validate_buffer()
take in offset_num
to make
sure that each producer multicasts a different set of data to validate
the NoC’s functionality. These functions are called for each producer.
based on the
4. FPGA execution
Baremetal
Once logic synthesis has completed, program FPGA.
# Program FPGA
make fpga-program
Open a UART interface and make sure that the esplink
application can reach the
FPGA (How to: design a single-core SoC).
Now you can program the FPGA
TEST_PROGRAM=./soft-build/<cpu_name>/baremetal/dummy_multicast_p2p.exe make fpga-run
Now, in the UART interface, you will see whether each test configuration passed or failed.
For the packetized multicast, use the following command.
TEST_PROGRAM=./soft-build/<cpu_name>/baremetal/dummy_multicast_p2p_packetized.exe make fpga-run
The packetized application will iterate through all possible sets of the producer-consumer configuration array for three sources and print whether the test passed or failed.
Linux
Once logic synthesis has completed, compile Linux, the example application and prepare the bootable image.
# Compile Linux, drivers and libraries
make linux
# Compile example application
make examples
# Update Linux image
make linux
Open a UART interface and make sure that the esplink
application can reach the
FPGA (How to: design a single-core SoC).
Now you can program the FPGA and run the Linux image
make fpga-run-linux
Once Linux boots, login as root
, using the password openesp
. Then, run the
test application. Press enter when prompted.
cd /examples/dummy_multicast_p2p
./dummy_multicast_p2p.exe
Feel free to change the application and test even more configurations of the
17 dummy accelerators. After Linux boots, you can upload a new application
using scp
!
FPGA prototyping with prebuilt material
With the provided prebuilt material, you can run the tutorial on FPGA directly. Each packet is marked with the first digits of the Git revision it was created and tested with.
The packet contains the following:
- Working folder for Xilinx VCU128, each including:
- The Linux image (
linux.bin
) - The boot loader image (
prom.bin
) - The FPGA bitstream (
top.bit
) - The hidden configuration files for the design (
.grlib_config
and.esp_config
) - A script to run the design on FPGA (
runme.sh
)
- The Linux image (
Decompress the content of the packet from the ESP root folder to make sure all files are extracted to the right location.
cd <esp>
tar xf ESP_MultiAcc_GitRev.5f0f335.tar.gz #TBD
Enter one of the soc instances extracted from the packet.
cd socs/multicast_vcu128
Follow the “UART interface” instructions from the “How to: design a
single-core SoC” guide,
then launch the runme.sh script
./runme.sh
Once Linux boots, login as root
, using the password openesp
. Then, run the
test application. Press enter when prompted.
cd /examples/dummy_multicast_p2p
./dummy_multicast_p2p.exe
Similarly, test packetized multicast with:
cd /examples/dummy_multicast_p2p_packetized
./dummy_multicast_p2p_packetized.exe