Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Optimizing Memory Performance of Xilinx FPGAs under Vitis

292 0 0.0 ( 0 )

Download Cite

Added by Ruoshi Li

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Ruoshi Li - Hongjing Huang - Zeke Wang

Distributed Parallel and Cluster Computing Hardware Architecture

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Plenty of research efforts have been devoted to FPGA-based acceleration, due to its low latency and high energy efficiency. However, using the original low-level hardware description languages like Verilog to program FPGAs requires generally good knowledge of hardware design details and hand-on experiences. Fortunately, the FPGA community intends to address this low programmability issues. For example, , with the intention that programming FPGAs is just as easy as programming GPUs. Even though Vitis is proven to increase programmability, we cannot directly obtain high performance without careful design regarding hardware pipeline and memory subsystem.In this paper, we focus on the memory subsystem, comprehensively and systematically benchmarking the effect of optimization methods on memory performance. Upon benchmarking, we quantitatively analyze the typical memory access patterns for a broad range of applications, including AI, HPC, and database. Further, we also provide the corresponding optimization direction for each memory access pattern so as to improve overall performance.

rate research

High Bandwidth Memory on FPGAs: A Data Analytics Perspective

85 - Kaan Kara , Christoph Hagleitner , Dionysios Diamantopoulos 2020

FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the ensuing necessity for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwidth bottleneck, faced often by FPGA-based accelerators due to their throughput oriented design. In this paper, we study the usage and benefits of HBM on FPGAs from a data analytics perspective. We consider three workloads that are often performed in analytics oriented databases and implement them on FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. In certain cases, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER9 system or a 14-core XeonE5 by up to 1.8x (selection), 12.9x (join), and 3.2x (SGD).

Distributed Parallel and Cluster Computing Hardware Architecture

Co-Optimizing Performance and Memory FootprintVia Integrated CPU/GPU Memory Management, anImplementation on Autonomous Driving Platform

219 - Soroush Bateni , Zhendong Wang , Yuankun Zhu 2020

Cutting-edge embedded system applications, such as self-driving cars and unmanned drone software, are reliant on integrated CPU/GPU platforms for their DNNs-driven workload, such as perception and other highly parallel components. In this work, we set out to explore the hidden performance implication of GPU memory management methods of integrated CPU/GPU architecture. Through a series of experiments on micro-benchmarks and real-world workloads, we find that the performance under different memory management methods may vary according to application characteristics. Based on this observation, we develop a performance model that can predict system overhead for each memory management method based on application characteristics. Guided by the performance model, we further propose a runtime scheduler. By conducting per-task memory management policy switching and kernel overlapping, the scheduler can significantly relieve the system memory pressure and reduce the multitasking co-run response time. We have implemented and extensively evaluated our system prototype on the NVIDIA Jetson TX2, Drive PX2, and Xavier AGX platforms, using both Rodinia benchmark suite and two real-world case studies of drone software and autonomous driving software.

Distributed Parallel and Cluster Computing Operating Systems

Understanding and Optimizing Persistent Memory Allocation

92 - Wentao Cai , Haosen Wen , H. Alan Beadle 2020

The proliferation of fast, dense, byte-addressable nonvolatile memory suggests that data might be kept in pointer-rich in-memory format across program runs and even process and system crashes. For full generality, such data requires dynamic memory allocation, and while the allocator could in principle rolled into each data structure, it is desirable to make it a separate abstraction. Toward this end, we introduce recoverability, a correctness criterion for persistent allocators, together with a nonblocking allocator, Ralloc, that satisfies this criterion. Ralloc is based on the LRMalloc of Leite and Rocha, with three key innovations. First, we persist just enough information during normal operation to permit correct reconstruction of the heap after a full-system crash. Our reconstruction mechanism performs garbage collection (GC) to identify and remedy any failure-induced memory leaks. Second, we introduce the notion of filter functions, which identify the locations of pointers within persistent blocks to mitigate the limitations of conservative GC. Third, to allow persistent regions to be mapped at an arbitrary address, we employ position-independent (offset-based) pointers for both data and metadata. Experiments show Ralloc to be performance-competitive with both Makalu, the state-of-the-art lock-based persistent allocator, and such transient allocators as LRMalloc and JEMalloc. In particular, reliance on GC and offline metadata reconstruction allows Ralloc to pay almost nothing for persistence during normal operation.

Distributed Parallel and Cluster Computing

Read Mapping Near Non-Volatile Memory

86 - S. Karen Khatamifard , Zamshed Chowdhury , Nakul Pande 2017

DNA sequencing is the physical/biochemical process of identifying the location of the four bases (Adenine, Guanine, Cytosine, Thymine) in a DNA strand. As semiconductor technology revolutionized computing, modern DNA sequencing technology (termed Next Generation Sequencing, NGS)revolutionized genomic research. As a result, modern NGS platforms can sequence hundreds of millions of short DNA fragments in parallel. The sequenced DNA fragments, representing the output of NGS platforms, are termed reads. Besides genomic variations, NGS imperfections induce noise in reads. Mapping each read to (the most similar portion of) a reference genome of the same species, i.e., read mapping, is a common critical first step in a diverse set of emerging bioinformatics applications. Mapping represents a search-heavy memory-intensive similarity matching problem, therefore, can greatly benefit from near-memory processing. Intuition suggests using fast associative search enabled by Ternary Content Addressable Memory (TCAM) by construction. However, the excessive energy consumption and lack of support for similarity matching (under NGS and genomic variation induced noise) renders direct application of TCAM infeasible, irrespective of volatility, where only non-volatile TCAM can accommodate the large memory footprint in an area-efficient way. This paper introduces GeNVoM, a scalable, energy-efficient and high-throughput solution. Instead of optimizing an algorithm developed for general-purpose computers or GPUs, GeNVoM rethinks the algorithm and non-volatile TCAM-based accelerator design together from the ground up. Thereby GeNVoM can improve the throughput by up to 113.5 times (3.6); the energy consumption, by up to 210.9 times (1.36), when compared to a GPU (accelerator) baseline, which represents one of the highest-throughput implementations known.

Distributed Parallel and Cluster Computing Hardware Architecture

RDMAbox : Optimizing RDMA for Memory Intensive Workloads

100 - Juhyun Bae , Ling Liu , Yanzhao Wu 2021

We present RDMAbox, a set of low level RDMA optimizations that provide better performance than previous approaches. The optimizations are packaged in easy-to-use kernel and user space libraries for applications and systems in data center. We demonstrate the flexibility and effectiveness of RDMAbox by implementing a kernel remote paging system and a user space file system using RDMAbox. RDMAbox employs two optimization techniques. First, we suggest RDMA request merging and chaining to further reduce the total number of I/O operations to the RDMA NIC. The I/O merge queue at the same time functions as a traffic regulator to enforce admission control and avoid overloading the NIC. Second, we propose Adaptive Polling to achieve higher efficiency of polling Work Completion than existing busy polling while maintaining the low CPU overhead of event trigger. Our implementation of a remote paging system with RDMAbox outperforms existing representative solutions with up to 4? throughput improvement and up to 83% decrease in average tail latency in bigdata workloads, and up to 83% reduction in completion time in machine learning workloads. Our implementation of a user space file system based on RDMAbox achieves up to 5.9? higher throughput over existing representative solutions.

Distributed Parallel and Cluster Computing

comments

Fetching comments

Tartous University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Optimizing Memory Performance of Xilinx FPGAs under Vitis

Ask ChatGPT about the research

No Arabic abstract

Read More