Programmable FPGA-based Memory Controller

144 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Sasindu Wijeratne

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Sasindu Wijeratne - Sanket Pattnaik - Zhiyu Chen

هندسة العتاد الذكاء الاصطناعي النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Even with generational improvements in DRAM technology, memory access latency still remains the major bottleneck for application accelerators, primarily due to limitations in memory interface IPs which cannot fully account for variations in target applications, the algorithms used, and accelerator architectures. Since developing memory controllers for different applications is time-consuming, this paper introduces a modular and programmable memory controller that can be configured for different target applications on available hardware resources. The proposed memory controller efficiently supports cache-line accesses along with bulk memory transfers. The user can configure the controller depending on the available logic resources on the FPGA, memory access pattern, and external memory specifications. The modular design supports various memory access optimization techniques including, request scheduling, internal caching, and direct memory access. These techniques contribute to reducing the overall latency while maintaining high sustained bandwidth. We implement the system on a state-of-the-art FPGA and evaluate its performance using two widely studied domains: graph analytics and deep learning workloads. We show improved overall memory access time up to 58% on CNN and GCN workloads compared with commercial memory controller IPs.

قيم البحث

86 - Fei Wen , Mian Qin , Paul V. Gratz 2020

Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have h igher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of two memory classes point towards the design of hybrid memory systems containing multiple classes of main memory. In the iterative and incremental development of new architectures, the timeliness of simulation completion is critical to project progression. Hence, a highly efficient simulation method is needed to evaluate the performance of different hybrid memory system designs. Design exploration for hybrid memory systems is challenging, because it requires emulation of the full system stack, including the OS, memory controller, and interconnect. Moreover, benchmark applications for memory performance test typically have much larger working sets, thus taking even longer simulation warm-up period. In this paper, we propose a FPGA-based hybrid memory system emulation platform. We target at the mobile computing system, which is sensitive to energy consumption and is likely to adopt NVM for its power efficiency. Here, because the focus of our platform is on the design of the hybrid memory system, we leverage the on-board hard IP ARM processors to both improve simulation performance while improving accuracy of the results. Thus, users can implement their data placement/migration policies with the FPGA logic elements and evaluate new designs quickly and effectively. Results show that our emulation platform provides a speedup of 9280x in simulation time compared to the software counterpart Gem5.

هندسة العتاد

A Flexible High-Bandwidth Low-Latency Multi-Port Memory Controller

110 - Xuan-Thuan Nguyen , Duc-Hung Le , Trong-Tu Bui 2017

Multi-port memory controllers (MPMCs) have become increasingly important in many modern applications due to the tremendous growth in bandwidth requirement. Many approaches so far have focused on improving either the memory access latency or the bandw idth utilization for specific applications. Moreover, the application systems are likely to require certain adjustments to connect with an MPMC, since the MPMC interface is limited to a single-clock and single data-width domain. In this paper, we propose efficient techniques to improve the flexibility, latency, and bandwidth of an MPMC. Firstly, MPMC interfaces employ a pair of dual-clock dual-port FIFOs at each port, so any multi-clock multi-data-width application system can connect to an MPMC without requiring extra resources. Secondly, memory access latency is significantly reduced because parallel FIFOs temporarily keep the data transfer between the application system and memory. Lastly, a proposed arbitration scheme, namely window-based first-come-first-serve, considerably enhances the bandwidth utilization. Depending on the applications, MPMC can be properly configured by updating several internal configuration registers. The experimental results in an Altera Cyclone FPGA prove that MPMC is fully operational at 150 MHz and supports up to 32 concurrent connections at various clocks and data widths. More significantly, achieved bandwidth utilization is approximately 93.2% of the theoretical bandwidth, and the access latency is minimized as compared to previous designs.

هندسة العتاد

Block Convolution: Towards Memory-Efficient Inference of Large-Scale CNNs on FPGA

75 - Gang Li , Zejian Liu , Fanrong Li 2021

Deep convolutional neural networks have achieved remarkable progress in recent years. However, the large volume of intermediate results generated during inference poses a significant challenge to the accelerator design for resource-constraint FPGA. D ue to the limited on-chip storage, partial results of intermediate layers are frequently transferred back and forth between on-chip memory and off-chip DRAM, leading to a non-negligible increase in latency and energy consumption. In this paper, we propose block convolution, a hardware-friendly, simple, yet efficient convolution operation that can completely avoid the off-chip transfer of intermediate feature maps at run-time. The fundamental idea of block convolution is to eliminate the dependency of feature map tiles in the spatial dimension when spatial tiling is used, which is realized by splitting a feature map into independent blocks so that convolution can be performed separately on individual blocks. We conduct extensive experiments to demonstrate the efficacy of the proposed block convolution on both the algorithm side and the hardware side. Specifically, we evaluate block convolution on 1) VGG-16, ResNet-18, ResNet-50, and MobileNet-V1 for ImageNet classification task; 2) SSD, FPN for COCO object detection task, and 3) VDSR for Set5 single image super-resolution task. Experimental results demonstrate that comparable or higher accuracy can be achieved with block convolution. We also showcase two CNN accelerators via algorithm/hardware co-design based on block convolution on memory-limited FPGAs, and evaluation shows that both accelerators substantially outperform the baseline without off-chip transfer of intermediate feature maps.

هندسة العتاد النظم الموزعة والتوازية والحوسبة العنقودية

On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method

112 - Kaiqi Zhang , Cole Hawkins , Xiyuan Zhang 2021

Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent tra ining neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves $59times$ speedup and $123times$ energy reduction compared to embedded CPU, and $292times$ memory reduction over a standard full-size training.

هندسة العتاد

JANUS: an FPGA-based System for High Performance Scientific Computing

589 - F. Belletti , M. Cotallo , A. Cruz 2008

This paper describes JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. Each JANUS module has a computational core and a host. The computational core is a 4x4 array of FPGA-based processing elements with nearest-neigh bor data links. Processors are also directly connected to an I/O node attached to the JANUS host, a conventional PC. JANUS is tailored for, but not limited to, the requirements of a class of hard scientific applications characterized by regular code structure, unconventional data manipulation instructions and not too large data-base size. We discuss the architecture of this configurable machine, and focus on its use on Monte Carlo simulations of statistical mechanics. On this class of application JANUS achieves impressive performances: in some cases one JANUS processing element outperfoms high-end PCs by a factor ~ 1000. We also discuss the role of JANUS on other classes of scientific applications.

هندسة العتاد