ﻻ يوجد ملخص باللغة العربية
With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, we found that it is not easy to fully utilize the available bandwidth when developing some applications with high-level synthesis (HLS) tools. This is due to the limitation of existing HLS tools when accessing HBM boards large number of independent external memory channels. In this paper, we measure the performance of three recent representative HBM FPGA boards (Intels Stratix 10 MX and Xilinxs Alveo U50/U280 boards) with microbenchmarks and analyze the HLS overhead. Next, we propose HLS-based optimization techniques to improve the effective bandwidth when a PE accesses multiple HBM channels or multiple PEs access an HBM channel. Our experiment demonstrates that the effective bandwidth improves by 2.4X-3.8X. We also provide a list of insights for future improvement of the HBM FPGA HLS design flow.
High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing multiple memory channels to the processing units. To achieve high performance, an accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform) n
Even with generational improvements in DRAM technology, memory access latency still remains the major bottleneck for application accelerators, primarily due to limitations in memory interface IPs which cannot fully account for variations in target ap
FPGAs have become emerging computing infrastructures for accelerating applications in datacenters. Meanwhile, high-level synthesis (HLS) tools have been proposed to ease the programming of FPGAs. Even with HLS, irregular data-intensive applications r
Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have h
This paper presents the FPGA hardware design of a turbo decoder for the cdma2000 standard. The work includes a study and mathematical analysis of the turbo decoding process, based on the MAX-Log-MAP algorithm. Results of decoding for a packet size of