Memory System Designed for Multiply-Accumulate (MAC) Engine Based on Stochastic Computing

101 0 0.0 ( 0 )

Download Cite

Added by Xinyue Zhang

Publication date 2019

fields Electronic Engineering

and research's language is English

Authors Xinyue Zhang - Yuan Wang - Yawen Zhang

Signal Processing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Convolutional neural network (CNN) achieves excellent performance on fascinating tasks such as image recognition and natural language processing at the cost of high power consumption. Stochastic computing (SC) is an attractive paradigm implemented in low power applications which performs arithmetic operations with simple logic and low hardware cost. However, conventional memory structure designed and optimized for binary computing leads to extra data conversion costs, which significantly decreases the energy efficiency. Therefore, a new memory system designed for SC-based multiply-accumulate (MAC) engine applied in CNN which is compatible with conventional memory system is proposed in this paper. As a result, the overall energy consumption of our new computing structure is 0.91pJ, which is reduced by 82.1% compared with the conventional structure, and the energy efficiency achieves 164.8 TOPS/W.

rate research

An Energy-Efficient Mixed-Signal Parallel Multiply-Accumulate (MAC) Engine Based on Stochastic Computing

241 - Xinyue Zhang , Jiahao Song , Yuan Wang 2019

Convolutional neural networks (CNN) have achieved excellent performance on various tasks, but deploying CNN to edge is constrained by the high energy consumption of convolution operation. Stochastic computing (SC) is an attractive paradigm which performs arithmetic operations with simple logic gates and low hardware cost. This paper presents an energy-efficient mixed-signal multiply-accumulate (MAC) engine based on SC. A parallel architecture is adopted in this work to solve the latency problem of SC. The simulation results show that the overall energy consumption of our design is 5.03pJ per 26-input MAC operation under 28nm CMOS technology.

Signal Processing

On Memory System Design for Stochastic Computing

71 - S. Karen Khatamifard , M. Hassan Najafi , Ali Ghoreyshi 2017

Growing uncertainty in design parameters (and therefore, in design functionality) renders stochastic computing particularly promising, which represents and processes data as quantized probabilities. However, due to the difference in data representation, integrating conventional memory (designed and optimized for non-stochastic computing) in stochastic computing systems inevitably incurs a significant data conversion overhead. Barely any stochastic computing proposal to-date covers the memory impact. In this paper, as the first study of its kind to the best of our knowledge, we rethink the memory system design for stochastic computing. The result is a seamless stochastic system, StochMem, which features analog memory to trade the energy and area overhead of data conversion for computation accuracy. In this manner StochMem can reduce the energy (area) overhead by up-to 52.8% (93.7%) at the cost of at most 0.7% loss in computation accuracy.

Emerging Technologies

MRAM Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications

122 - Baohua Sun , Daniel Liu , Leo Yu 2018

We designed a device for Convolution Neural Network applications with non-volatile MRAM memory and computing-in-memory co-designed architecture. It has been successfully fabricated using 22nm technology node CMOS Si process. More than 40MB MRAM density with 9.9TOPS/W are provided. It enables multiple models within one single chip for mobile and IoT device applications.

Signal Processing

Radio-Frequency Multiply-And-Accumulate Operations with Spintronic Synapses

102 - N. Leroux 2020

Exploiting the physics of nanoelectronic devices is a major lead for implementing compact, fast, and energy efficient artificial intelligence. In this work, we propose an original road in this direction, where assemblies of spintronic resonators used as artificial synapses can classify an-alogue radio-frequency signals directly without digitalization. The resonators convert the ra-dio-frequency input signals into direct voltages through the spin-diode effect. In the process, they multiply the input signals by a synaptic weight, which depends on their resonance fre-quency. We demonstrate through physical simulations with parameters extracted from exper-imental devices that frequency-multiplexed assemblies of resonators implement the corner-stone operation of artificial neural networks, the Multiply-And-Accumulate (MAC), directly on microwave inputs. The results show that even with a non-ideal realistic model, the outputs obtained with our architecture remain comparable to that of a traditional MAC operation. Us-ing a conventional machine learning framework augmented with equations describing the physics of spintronic resonators, we train a single layer neural network to classify radio-fre-quency signals encoding 8x8 pixel handwritten digits pictures. The spintronic neural network recognizes the digits with an accuracy of 99.96 %, equivalent to purely software neural net-works. This MAC implementation offers a promising solution for fast, low-power radio-fre-quency classification applications, and a new building block for spintronic deep neural net-works.

Disordered Systems and Neural Networks Mesoscale and Nanoscale Physics

Clio: A Hardware-Software Co-Designed Disaggregated Memory System

164 - Zhiyuan Guo , Yizhou Shan , Xuhao Luo 2021

Memory disaggregation has attracted great attention recently because of its benefits in efficient memory utilization and ease of management. So far, memory disaggregation research has all taken one of two approaches, building/emulating memory nodes with either regular servers or raw memory devices with no processing power. The former incurs higher monetary cost and face tail latency and scalability limitations, while the latter introduce performance, security, and management problems. Server-based memory nodes and memory nodes with no processing power are two extreme approaches. We seek a sweet spot in the middle by proposing a hardware-based memory disaggregation solution that has the right amount of processing power at memory nodes. Furthermore, we take a clean-slate approach by starting from the requirements of memory disaggregation and designing a memory-disaggregation-native system. We propose a hardware-based disaggregated memory system, Clio, that virtualizes and manages disaggregated memory at the memory node. Clio includes a new hardware-based virtual memory system, a customized network system, and a framework for computation offloading. In building Clio, we not only co-design OS functionalities, hardware architecture, and the network system, but also co-design the compute node and memory node. We prototyped Clios memory node with FPGA and implemented its client-node functionalities in a user-space library. Clio achieves 100 Gbps throughput and an end-to-end latency of 2.5 us at median and 3.2 us at the 99th percentile. Clio scales much better and has orders of magnitude lower tail latency than RDMA, and it has 1.1x to 3.4x energy saving compared to CPU-based and SmartNIC-based disaggregated memory systems and is 2.7x faster than software-based SmartNIC solutions.

Distributed Parallel and Cluster Computing