FlashAbacus: A Self-Governing Flash-Based Accelerator for Low-Power Systems

128 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Myoungsoo Jung

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jie Zhang - Myoungsoo Jung

هندسة العتاد

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Energy efficiency and computing flexibility are some of the primary design constraints of heterogeneous computing. In this paper, we present FlashAbacus, a data-processing accelerator that self-governs heterogeneous kernel executions and data storage accesses by integrating many flash modules in lightweight multiprocessors. The proposed accelerator can simultaneously process data from different applications with diverse types of operational functions, and it allows multiple kernels to directly access flash without the assistance of a host-level file system or an I/O runtime library. We prototype FlashAbacus on a multicore-based PCIe platform that connects to FPGA-based flash controllers with a 20 nm node process. The evaluation results show that FlashAbacus can improve the bandwidth of data processing by 127%, while reducing energy consumption by 78.4%, as compared to a conventional method of heterogeneous computing. blfootnote{This paper is accepted by and will be published at 2018 EuroSys. This document is presented to ensure timely dissemination of scholarly and technical work.

قيم البحث

114 - Petar Jokic , Erfan Azarkhish , Andrea Bonetti 2021

Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators.

هندسة العتاد التعلم الآلي

A Low-Power Accelerator for Deep Neural Networks with Enlarged Near-Zero Sparsity

83 - Yuxiang Huan , Yifan Qin , Yantian You 2017

It remains a challenge to run Deep Learning in devices with stringent power budget in the Internet-of-Things. This paper presents a low-power accelerator for processing Deep Neural Networks in the embedded devices. The power reduction is realized by avoiding multiplications of near-zero valued data. The near-zero approximation and a dedicated Near-Zero Approximation Unit (NZAU) are proposed to predict and skip the near-zero multiplications under certain thresholds. Compared with skipping zero-valued computations, our design achieves 1.92X and 1.51X further reduction of the total multiplications in LeNet-5 and Alexnet respectively, with negligible lose of accuracy. In the proposed accelerator, 256 multipliers are grouped into 16 independent Processing Lanes (PL) to support up to 16 neuron activations simultaneously. With the help of data pre-processing and buffering in each PL, multipliers can be clock-gated in most of the time even the data is excessively streaming in. Designed and simulated in UMC 65 nm process, the accelerator operating at 500 MHz is $>$ 4X faster than the mobile GPU Tegra K1 in processing the fully-connected layer FC8 of Alexnet, while consuming 717X less energy.

هندسة العتاد

A Novel Low Power Non-Volatile SRAM Cell with Self Write Termination

376 - Kanika Monga , Akul Malhotra , Nitin Chaturvedi 2019

A non-volatile SRAM cell is proposed for low power applications using Spin Transfer Torque-Magnetic Tunnel Junction (STT-MTJ) devices. This novel cell offers non-volatile storage, thus allowing selected blocks of SRAM to be switched off during standb y operation. To further increase the power savings, a write termination circuit is designed which detects completion of MTJ write and closes the bidirectional current path for the MTJ. A reduction of 25.81% in the number of transistors and a reduction of 2.95% in the power consumption is achieved in comparison to prior work on write termination circuits.

هندسة العتاد التقنيات الناشئة

DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator

82 - Xiaofan Zhang , Hanchen Ye , Junsong Wang 2020

Existing FPGA-based DNN accelerators typically fall into two design paradigms. Either they adopt a generic reusable architecture to support different DNN networks but leave some performance and efficiency on the table because of the sacrifice of desi gn specificity. Or they apply a layer-wise tailor-made architecture to optimize layer-specific demands for computation and resources but loose the scalability of adaptation to a wide range of DNN networks. To overcome these drawbacks, this paper proposes a novel FPGA-based DNN accelerator design paradigm and its automation tool, called DNNExplorer, to enable fast exploration of various accelerator designs under the proposed paradigm and deliver optimized accelerator architectures for existing and emerging DNN networks. Three key techniques are essential for DNNExplorers improved performance, better specificity, and scalability, including (1) a unique accelerator design paradigm with both high-dimensional design space support and fine-grained adjustability, (2) a dynamic design space to accommodate different combinations of DNN workloads and targeted FPGAs, and (3) a design space exploration (DSE) engine to generate optimized accelerator architectures following the proposed paradigm by simultaneously considering both FPGAs computation and memory resources and DNN networks layer-wise characteristics and overall complexity. Experimental results show that, for the same FPGAs, accelerators generated by DNNExplorer can deliver up to 4.2x higher performances (GOP/s) than the state-of-the-art layer-wise pipelined solutions generated by DNNBuilder for VGG-like DNN with 38 CONV layers. Compared to accelerators with generic reusable computation units, DNNExplorer achieves up to 2.0x and 4.4x DSP efficiency improvement than a recently published accelerator design from academia (HybridDNN) and a commercial DNN accelerator IP (Xilinx DPU), respectively.

هندسة العتاد

Optimizing FPGA-based Accelerator Design for Large-Scale Molecular Similarity Search

224 - Hongwu Peng , Shiyang Chen , Zhepeng Wang 2021

Molecular similarity search has been widely used in drug discovery to identify structurally similar compounds from large molecular databases rapidly. With the increasing size of chemical libraries, there is growing interest in the efficient accelerat ion of large-scale similarity search. Existing works mainly focus on CPU and GPU to accelerate the computation of the Tanimoto coefficient in measuring the pairwise similarity between different molecular fingerprints. In this paper, we propose and optimize an FPGA-based accelerator design on exhaustive and approximate search algorithms. On exhaustive search using BitBound & folding, we analyze the similarity cutoff and folding level relationship with search speedup and accuracy, and propose a scalable on-the-fly query engine on FPGAs to reduce the resource utilization and pipeline interval. We achieve a 450 million compounds-per-second processing throughput for a single query engine. On approximate search using hierarchical navigable small world (HNSW), a popular algorithm with high recall and query speed. We propose an FPGA-based graph traversal engine to utilize a high throughput register array based priority queue and fine-grained distance calculation engine to increase the processing capability. Experimental results show that the proposed FPGA-based HNSW implementation has a 103385 query per second (QPS) on the Chembl database with 0.92 recall and achieves a 35x speedup than the existing CPU implementation on average. To the best of our knowledge, our FPGA-based implementation is the first attempt to accelerate molecular similarity search algorithms on FPGA and has the highest performance among existing approaches.

هندسة العتاد

سجل دخول لتتمكن من نشر تعليقات

التعليقات (0)

no comments...

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد الوطني الجزائري للبحث الزراعي

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

FlashAbacus: A Self-Governing Flash-Based Accelerator for Low-Power Systems

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً