Exploring the Feasibility of Using 3D XPoint as an In-Memory Computing Accelerator

69 0 0.0 ( 0 )

Download Cite

Added by Masoud Zabihi

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Masoud Zabihi - Salonik Resch - Husrev C{i}lasun

Hardware Architecture

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper describes how 3D XPoint memory arrays can be used as in-memory computing accelerators. We first show that thresholded matrix-vector multiplication (TMVM), the fundamental computational kernel in many applications including machine learning, can be implemented within a 3D XPoint array without requiring data to leave the array for processing. Using the implementation of TMVM, we then discuss the implementation of a binary neural inference engine. We discuss the application of the core concept to address issues such as system scalability, where we connect multiple 3D XPoint arrays, and power integrity, where we analyze the parasitic effects of metal lines on noise margins. To assure power integrity within the 3D XPoint array during this implementation, we carefully analyze the parasitic effects of metal lines on the accuracy of the implementations. We quantify the impact of parasitics on limiting the size and configuration of a 3D XPoint array, and estimate the maximum acceptable size of a 3D XPoint subarray.

rate research

AccSS3D: Accelerator for Spatially Sparse 3D DNNs

83 - Om Ji Omer , Prashant Laddha , Gurpreet S Kalsi 2020

Semantic understanding and completion of real world scenes is a foundational primitive of 3D Visual perception widely used in high-level applications such as robotics, medical imaging, autonomous driving and navigation. Due to the curse of dimensionality, compute and memory requirements for 3D scene understanding grow in cubic complexity with voxel resolution, posing a huge impediment to realizing real-time energy efficient deployments. The inherent spatial sparsity present in the 3D world due to free space is fundamentally different from the channel-wise sparsity that has been extensively studied. We present ACCELERATOR FOR SPATIALLY SPARSE 3D DNNs (AccSS3D), the first end-to-end solution for accelerating 3D scene understanding by exploiting the ample spatial sparsity. As an algorithm-dataflow-architecture co-designed system specialized for spatially-sparse 3D scene understanding, AccSS3D includes novel spatial locality-aware metadata structures, a near-zero latency and spatial sparsity-aware dataflow optimizer, a surface orientation aware pointcloud reordering algorithm and a codesigned hardware accelerator for spatial sparsity that exploits data reuse through systolic and multicast interconnects. The SSpNNA accelerator core together with the 64 KB of L1 memory requires 0.92 mm2 of area in 16nm process at 1 GHz. Overall, AccSS3D achieves 16.8x speedup and a 2232x energy efficiency improvement for 3D sparse convolution compared to an Intel-i7-8700K 4-core CPU, which translates to a 11.8x end-to-end 3D semantic segmentation speedup and a 24.8x energy efficiency improvement (iso technology node)

Hardware Architecture

Analysing digital in-memory computing for advanced finFET node

70 - Veerendra S Devaraddi 2021

Digital In-memory computing improves energy efficiency and throughput of a data-intensive process, which incur memory thrashing and, resulting multiple same memory accesses in a von Neumann architecture. Digital in-memory computing involves accessing multiple SRAM cells simultaneously, which may result in a bit flip when not timed critically. Therefore we discuss the transient voltage characteristics of the bitlines during an SRAM compute. To improve the packaging density and also avoid MOSFET down-scaling issues, we use a 7-nm predictive PDK which uses a finFET node. The finFET process has discrete fins and a lower Voltage supply, which makes the design of in-memory compute SRAM difficult. In this paper, we design a 6T SRAM cell in 7-nm finFET node and compare its SNMs with a UMC 28nm node implementation. Further, we design and simulate the rest of the SRAM peripherals, and in-memory computation for an advanced finFET node.

Hardware Architecture Systems and Control Systems and Control

An 8-bit In Resistive Memory Computing Core with Regulated Passive Neuron and Bit Line Weight Mapping

146 - Yewei Zhang , Kejie Huang (Senior Member 2020

The rapid development of Artificial Intelligence (AI) and Internet of Things (IoT) increases the requirement for edge computing with low power and relatively high processing speed devices. The Computing-In-Memory(CIM) schemes based on emerging resistive Non-Volatile Memory(NVM) show great potential in reducing the power consumption for AI computing. However, the device inconsistency of the non-volatile memory may significantly degenerate the performance of the neural network. In this paper, we propose a low power Resistive RAM (RRAM) based CIM core to not only achieve high computing efficiency but also greatly enhance the robustness by bit line regulator and bit line weight mapping algorithm. The simulation results show that the power consumption of our proposed 8-bit CIM core is only 3.61mW (256*256). The SFDR and SNDR of the CIM core achieve 59.13 dB and 46.13 dB, respectively. The proposed bit line weight mapping scheme improves the top-1 accuracy by 2.46% and 3.47% for AlexNet and VGG16 on ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC 2012) in 8-bit mode, respectively.

Hardware Architecture

NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling

71 - Gagandeep Singh , Dionysios Diamantopoulos , Christoph Hagleitner 2020

Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an FPGA+HBM-based accelerator connected through IBM CAPI2 (Coherent Accelerator Processor Interface) to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by 4.2x and 8.3x when running two different compound stencil kernels. NERO reduces the energy consumption by 22x and 29x for the same two kernels over the POWER9 system with an energy efficiency of 1.5 GFLOPS/Watt and 17.3 GFLOPS/Watt. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.

Hardware Architecture

DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator

82 - Xiaofan Zhang , Hanchen Ye , Junsong Wang 2020

Existing FPGA-based DNN accelerators typically fall into two design paradigms. Either they adopt a generic reusable architecture to support different DNN networks but leave some performance and efficiency on the table because of the sacrifice of design specificity. Or they apply a layer-wise tailor-made architecture to optimize layer-specific demands for computation and resources but loose the scalability of adaptation to a wide range of DNN networks. To overcome these drawbacks, this paper proposes a novel FPGA-based DNN accelerator design paradigm and its automation tool, called DNNExplorer, to enable fast exploration of various accelerator designs under the proposed paradigm and deliver optimized accelerator architectures for existing and emerging DNN networks. Three key techniques are essential for DNNExplorers improved performance, better specificity, and scalability, including (1) a unique accelerator design paradigm with both high-dimensional design space support and fine-grained adjustability, (2) a dynamic design space to accommodate different combinations of DNN workloads and targeted FPGAs, and (3) a design space exploration (DSE) engine to generate optimized accelerator architectures following the proposed paradigm by simultaneously considering both FPGAs computation and memory resources and DNN networks layer-wise characteristics and overall complexity. Experimental results show that, for the same FPGAs, accelerators generated by DNNExplorer can deliver up to 4.2x higher performances (GOP/s) than the state-of-the-art layer-wise pipelined solutions generated by DNNBuilder for VGG-like DNN with 38 CONV layers. Compared to accelerators with generic reusable computation units, DNNExplorer achieves up to 2.0x and 4.4x DSP efficiency improvement than a recently published accelerator design from academia (HybridDNN) and a commercial DNN accelerator IP (Xilinx DPU), respectively.

Hardware Architecture

comments

Fetching comments

Yarmouk Private University

Additional details More universities

Exploring the Feasibility of Using 3D XPoint as an In-Memory Computing Accelerator

Ask ChatGPT about the research

No Arabic abstract

Read More