ترغب بنشر مسار تعليمي؟ اضغط هنا

The inherent dynamics of the neuron membrane potential in Spiking Neural Networks (SNNs) allows processing of sequential learning tasks, avoiding the complexity of recurrent neural networks. The highly-sparse spike-based computations in such spatio-t emporal data can be leveraged for energy-efficiency. However, the membrane potential incurs additional memory access bottlenecks in current SNN hardware. To that effect, we propose a 10T-SRAM compute-in-memory (CIM) macro, specifically designed for state-of-the-art SNN inference. It consists of a fused weight (WMEM) and membrane potential (VMEM) memory and inherently exploits sparsity in input spikes leading to 97.4% reduction in energy-delay-product (EDP) at 85% sparsity (typical of SNNs considered in this work) compared to the case of no sparsity. We propose staggered data mapping and reconfigurable peripherals for handling different bit-precision requirements of WMEM and VMEM, while supporting multiple neuron functionalities. The proposed macro was fabricated in 65nm CMOS technology, achieving an energy-efficiency of 0.99TOPS/W at 0.85V supply and 200MHz frequency for signed 11-bit operations. We evaluate the SNN for sentiment classification from the IMDB dataset of movie reviews and achieve within 1% accuracy of an LSTM network with 8.5x lower parameters.
There is widespread interest in emerging technologies, especially resistive crossbars for accelerating Deep Neural Networks (DNNs). Resistive crossbars offer a highly-parallel and efficient matrix-vector-multiplication (MVM) operation. MVM being the most dominant operation in DNNs makes crossbars ideally suited. However, various sources of device and circuit non-idealities lead to errors in the MVM output, thereby reducing DNN accuracy. Towards that end, we propose crossbar re-mapping strategies to mitigate line-resistance induced accuracy degradation in DNNs, without having to re-train the learned weights, unlike most prior works. Line-resistances degrade the voltage levels along the crossbar columns, thereby inducing more errors at the columns away from the drivers. We rank the DNN weights and kernels based on a sensitivity analysis, and re-arrange the columns such that the most sensitive kernels are mapped closer to the drivers, thereby minimizing the impact of errors on the overall accuracy. We propose two algorithms $-$ static remapping strategy (SRS) and dynamic remapping strategy (DRS), to optimize the crossbar re-arrangement of a pre-trained DNN. We demonstrate the benefits of our approach on a standard VGG16 network trained using CIFAR10 dataset. Our results show that SRS and DRS limit the accuracy degradation to 2.9% and 2.1%, respectively, compared to a 5.6% drop from an as it is mapping of weights and kernels to crossbars. We believe this work brings an additional aspect for optimization, which can be used in tandem with existing mitigation techniques, such as in-situ compensation, technology aware training and re-training approaches, to enhance system performance.
Deep neural networks are a biologically-inspired class of algorithms that have recently demonstrated state-of-the-art accuracies involving large-scale classification and recognition tasks. Indeed, a major landmark that enables efficient hardware acce lerators for deep networks is the recent advances from the machine learning community that have demonstrated aggressively scaled deep binary networks with state-of-the-art accuracies. In this paper, we demonstrate how deep binary networks can be accelerated in modified von-Neumann machines by enabling binary convolutions within the SRAM array. In general, binary convolutions consist of bit-wise XNOR followed by a population-count (popcount). We present a charge sharing XNOR and popcount operation in 10 transistor SRAM cells. We have employed multiple circuit techniques including dual-read-worldines (Dual-RWL) along with a dual-stage ADC that overcomes the inaccuracies of a low precision ADC, to achieve a fairly accurate popcount. In addition, a key highlight of the present work is the fact that we propose sectioning of the SRAM array by adding switches onto the read-bitlines, thereby achieving improved parallelism. This is beneficial for deep networks, where the kernel size grows and requires to be stored in multiple sub-banks. As such, one needs to evaluate the partial popcount from multiple sub-banks and sum them up for achieving the final popcount. For n-sections per sub-array, we can perform n convolutions within one particular sub-bank, thereby improving overall system throughput as well as the energy efficiency. Our results at the array level show that the energy consumption and delay per-operation was 1.914pJ and 45ns, respectively. Moreover, an energy improvement of 2.5x, and a performance improvement of 4x was achieved by using the proposed sectioned-SRAM, compared to a non-sectioned SRAM design.
The inherent stochasticity in many nano-scale devices makes them prospective candidates for low-power computations. Such devices have been demonstrated to exhibit probabilistic switching between two stable states to achieve stochastic behavior. Recen tly, superparamagnetic nanomagnets (having low energy barrier EB $sim$ 1kT) have shown promise of achieving stochastic switching at GHz rates, with very low currents. On the other hand, voltage-controlled switching of nanomagnets through the Magneto-electric (ME) effect has shown further improvements in energy efficiency. In this simulation paper, we first analyze the stochastic switching characteristics of such super-paramagnetic nanomagnets in a voltage-controlled spintronic device. We study the influence of external bias on the switching behavior. Subsequently, we show that our proposed device leverages the voltage controlled stochasticity in performing low-voltage 8-bit analog to digital
Silicon-based Static Random Access Memories (SRAM) and digital Boolean logic have been the workhorse of the state-of-art computing platforms. Despite tremendous strides in scaling the ubiquitous metal-oxide-semiconductor transistor, the underlying te xtit{von-Neumann} computing architecture has remained unchanged. The limited throughput and energy-efficiency of the state-of-art computing systems, to a large extent, results from the well-known textit{von-Neumann bottleneck}. The energy and throughput inefficiency of the von-Neumann machines have been accentuated in recent times due to the present emphasis on data-intensive applications like artificial intelligence, machine learning textit{etc}. A possible approach towards mitigating the overhead associated with the von-Neumann bottleneck is to enable textit{in-memory} Boolean computations. In this manuscript, we present an augmented version of the conventional SRAM bit-cells, called textit{the X-SRAM}, with the ability to perform in-memory, vector Boolean computations, in addition to the usual memory storage operations. We propose at least six different schemes for enabling in-memory vector computations including NAND, NOR, IMP (implication), XOR logic gates with respect to different bit-cell topologies $-$ the 8T cell and the 8$^+$T Differential cell. In addition, we also present a novel textit{`read-compute-store} scheme, wherein the computed Boolean function can be directly stored in the memory without the need of latching the data and carrying out a subsequent write operation. The feasibility of the proposed schemes has been verified using predictive transistor models and Monte-Carlo variation analysis.
Despite huge success of artificial intelligence, hardware systems running these algorithms consume orders of magnitude higher energy compared to the human brain, mainly due to heavy data movements between the memory unit and the computation cores. Sp iking neural networks (SNNs) built using bio-plausible neuron and synaptic models have emerged as the power-efficient choice for designing cognitive applications. These algorithms involve several lookup-table (LUT) based function evaluations such as high-order polynomials and transcendental functions for solving complex neuro-synaptic models, that typically require additional storage. To that effect, we propose `SPARE - an in-memory, distributed processing architecture built on ROM-embedded RAM technology, for accelerating SNNs. ROM-embedded RAMs allow storage of LUTs, embedded within a typical memory array, without additional area overhead. Our proposed architecture consists of a 2-D array of Processing Elements (PEs). Since most of the computations are done locally within each PE, unnecessary data transfers are restricted, thereby alleviating the von-Neumann bottleneck. We evaluate SPARE for two different ROM-Embedded RAM structures - CMOS based ROM-Embedded SRAMs (R-SRAMs) and STT-MRAM based ROM-Embedded MRAMs (R-MRAMs). Moreover, we analyze trade-offs in terms of energy, area and performance, for using the two technologies on a range of image classification benchmarks. Furthermore, we leverage the additional storage density to implement complex neuro-synaptic functionalities. This enhances the utility of the proposed architecture by provisioning implementation of any neuron/synaptic behavior as necessitated by the application. Our results show up-to 1.75x, 1.95x and 1.95x improvement in energy, iso-storage area, and iso-area performance, respectively, by using neural network accelerators built on ROM-embedded RAM primitives.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا