No Arabic abstract
The rapid development of Artificial Intelligence (AI) and Internet of Things (IoT) increases the requirement for edge computing with low power and relatively high processing speed devices. The Computing-In-Memory(CIM) schemes based on emerging resistive Non-Volatile Memory(NVM) show great potential in reducing the power consumption for AI computing. However, the device inconsistency of the non-volatile memory may significantly degenerate the performance of the neural network. In this paper, we propose a low power Resistive RAM (RRAM) based CIM core to not only achieve high computing efficiency but also greatly enhance the robustness by bit line regulator and bit line weight mapping algorithm. The simulation results show that the power consumption of our proposed 8-bit CIM core is only 3.61mW (256*256). The SFDR and SNDR of the CIM core achieve 59.13 dB and 46.13 dB, respectively. The proposed bit line weight mapping scheme improves the top-1 accuracy by 2.46% and 3.47% for AlexNet and VGG16 on ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC 2012) in 8-bit mode, respectively.
We have designed and tested a parallel 8-bit ERSFQ arithmetic logic unit (ALU). The ALU design employs wave-pipelined instruction execution and features modular bit-slice architecture that is easily extendable to any number of bits and adaptable to current recycling. A carry signal synchronized with an asynchronous instruction propagation provides the wave-pipeline operation of the ALU. The ALU instruction set consists of 14 arithmetical and logical instructions. It has been designed and simulated for operation up to a 10 GHz clock rate at the 10-kA/cm2 fabrication process. The ALU is embedded into a shift-register-based high-frequency testbed with on-chip clock generator to allow for comprehensive high frequency testing for all possible operands. The 8-bit ERSFQ ALU, comprising 6840 Josephson junctions, has been fabricated with MIT Lincoln Lab 10-kA/cm2 SFQ5ee fabrication process featuring eight Nb wiring layers and a high-kinetic inductance layer needed for ERSFQ technology. We evaluated the bias margins for all instructions and various operands at both low and high frequency clock. At low frequency, clock and all instruction propagation through ALU were observed with bias margins of +/-11% and +/-9%, respectively. Also at low speed, the ALU exhibited correct functionality for all arithmetical and logical instructions with +/-6% bias margins. We tested the 8-bit ALU for all instructions up to 2.8 GHz clock frequency.
Processing-using-DRAM has been proposed for a limited set of basic operations (i.e., logic operations, addition). However, in order to enable full adoption of processing-using-DRAM, it is necessary to provide support for more complex operations. In this paper, we propose SIMDRAM, a flexible general-purpose processing-using-DRAM framework that (1) enables the efficient implementation of complex operations, and (2) provides a flexible mechanism to support the implementation of arbitrary user-defined operations. The SIMDRAM framework comprises three key steps. The first step builds an efficient MAJ/NOT representation of a given desired operation. The second step allocates DRAM rows that are reserved for computation to the operations input and output operands, and generates the required sequence of DRAM commands to perform the MAJ/NOT implementation of the desired operation in DRAM. The third step uses the SIMDRAM control unit located inside the memory controller to manage the computation of the operation from start to end, by executing the DRAM commands generated in the second step of the framework. We design the hardware and ISA support for SIMDRAM framework to (1) address key system integration challenges, and (2) allow programmers to employ new SIMDRAM operations without hardware changes. We evaluate SIMDRAM for reliability, area overhead, throughput, and energy efficiency using a wide range of operations and seven real-world applications to demonstrate SIMDRAMs generality. Using 16 DRAM banks, SIMDRAM provides (1) 88x and 5.8x the throughput, and 257x and 31x the energy efficiency, of a CPU and a high-end GPU, respectively, over 16 operations; (2) 21x and 2.1x the performance of the CPU and GPU, over seven real-world applications. SIMDRAM incurs an area overhead of only 0.2% in a high-end CPU.
We have designed and tested a parallel 8-bit ERSFQ binary shifter that is one of the essential circuits in the design of the energy-efficient superconducting CPU. The binary shifter performs a bi-directional SHIFT instruction of an 8-bit argument. It consists of a bi-direction triple-port shift register controlled by two (left and right) shift pulse generators asynchronously generating a set number of shift pulses. At first clock cycle, an 8-bit word is loaded into the binary shifter and a 3-bit shift argument is loaded into the desired shift-pulse generator. Next, the generator produces the required number of shift SFQ pulses (from 0 to 7) asynchronously, with a repetition rate set by the internal generator delay of ~ 30 ps. These SFQ pulses are applied to the left (positive) or the right (negative) input of the binary shifter. Finally, after the shift operation is completed, the resulting 8-bit word goes to the parallel output. The complete 8-bit ERSFQ binary shifter, consisting of 820 Josephson junctions, was simulated and optimized using PSCAN2. It was fabricated in MIT Lincoln Lab 10-kA/cm2 SFQ5ee fabrication process with a high-kinetic inductance layer. We have successfully tested the binary shifter at both the LSB-to-MSB and MSB-to-LSB propagation regimes for all eight shift arguments. A single shift operation on a single input word demonstrated operational margins of +/-16% of the dc bias current. The correct functionality of the 8-bit ERSFQ binary shifter with the large, exhaustive data pattern was observed within +/-10% margins of the dc bias current. In this paper, we describe the design and present the test results for the ERSFQ 8-bit parallel binary shifter.
Computations implemented on a physical system are fundamentally limited by the laws of physics. A prominent example for a physical law that bounds computations is the Landauer principle. According to this principle, erasing a bit of information requires a concentration of probability in phase space, which by Liouvilles theorem is impossible in pure Hamiltonian dynamics. It therefore requires dissipative dynamics with heat dissipation of at least $k_BTlog 2$ per erasure of one bit. Using a concrete example, we show that when the dynamic is confined to a single energy shell it is possible to concentrate the probability on this shell using Hamiltonian dynamic, and therefore to implement an erasable bit with no thermodynamic cost.
Hyperdimensional Computing (HDC) is an emerging computational framework that mimics important brain functions by operating over high-dimensional vectors, called hypervectors (HVs). In-memory computing implementations of HDC are desirable since they can significantly reduce data transfer overheads. All existing in-memory HDC platforms consider binary HVs where each dimension is represented with a single bit. However, utilizing multi-bit HVs allows HDC to achieve acceptable accuracies in lower dimensions which in turn leads to higher energy efficiencies. Thus, we propose a highly accurate and efficient multi-bit in-memory HDC inference platform called MIMHD. MIMHD supports multi-bit operations using ferroelectric field-effect transistor (FeFET) crossbar arrays for multiply-and-add and FeFET multi-bit content-addressable memories for associative search. We also introduce a novel hardware-aware retraining framework (HWART) that trains the HDC model to learn to work with MIMHD. For six popular datasets and 4000 dimension HVs, MIMHD using 3-bit (2-bit) precision HVs achieves (i) average accuracies of 92.6% (88.9%) which is 8.5% (4.8%) higher than binary implementations; (ii) 84.1x (78.6x) energy improvement over a GPU, and (iii) 38.4x (34.3x) speedup over a GPU, respectively. The 3-bit $times$ is 4.3x and 13x faster and more energy-efficient than binary HDC accelerators while achieving similar accuracies.