ERSFQ 8-bit Parallel Binary Shifter for Energy-Efficient Superconducting CPU

170 0 0.0 ( 0 )

Download Cite

Added by Igor Vernik

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors A. F. Kirichenko - M. Y. Kamkar - J. Walter

Hardware Architecture

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We have designed and tested a parallel 8-bit ERSFQ binary shifter that is one of the essential circuits in the design of the energy-efficient superconducting CPU. The binary shifter performs a bi-directional SHIFT instruction of an 8-bit argument. It consists of a bi-direction triple-port shift register controlled by two (left and right) shift pulse generators asynchronously generating a set number of shift pulses. At first clock cycle, an 8-bit word is loaded into the binary shifter and a 3-bit shift argument is loaded into the desired shift-pulse generator. Next, the generator produces the required number of shift SFQ pulses (from 0 to 7) asynchronously, with a repetition rate set by the internal generator delay of ~ 30 ps. These SFQ pulses are applied to the left (positive) or the right (negative) input of the binary shifter. Finally, after the shift operation is completed, the resulting 8-bit word goes to the parallel output. The complete 8-bit ERSFQ binary shifter, consisting of 820 Josephson junctions, was simulated and optimized using PSCAN2. It was fabricated in MIT Lincoln Lab 10-kA/cm2 SFQ5ee fabrication process with a high-kinetic inductance layer. We have successfully tested the binary shifter at both the LSB-to-MSB and MSB-to-LSB propagation regimes for all eight shift arguments. A single shift operation on a single input word demonstrated operational margins of +/-16% of the dc bias current. The correct functionality of the 8-bit ERSFQ binary shifter with the large, exhaustive data pattern was observed within +/-10% margins of the dc bias current. In this paper, we describe the design and present the test results for the ERSFQ 8-bit parallel binary shifter.

rate research

ERSFQ 8-bit Parallel Arithmetic Logic Unit

86 - A. F. Kirichenko , I. V. Vernik , M. Y. Kamkar 2019

We have designed and tested a parallel 8-bit ERSFQ arithmetic logic unit (ALU). The ALU design employs wave-pipelined instruction execution and features modular bit-slice architecture that is easily extendable to any number of bits and adaptable to current recycling. A carry signal synchronized with an asynchronous instruction propagation provides the wave-pipeline operation of the ALU. The ALU instruction set consists of 14 arithmetical and logical instructions. It has been designed and simulated for operation up to a 10 GHz clock rate at the 10-kA/cm2 fabrication process. The ALU is embedded into a shift-register-based high-frequency testbed with on-chip clock generator to allow for comprehensive high frequency testing for all possible operands. The 8-bit ERSFQ ALU, comprising 6840 Josephson junctions, has been fabricated with MIT Lincoln Lab 10-kA/cm2 SFQ5ee fabrication process featuring eight Nb wiring layers and a high-kinetic inductance layer needed for ERSFQ technology. We evaluated the bias margins for all instructions and various operands at both low and high frequency clock. At low frequency, clock and all instruction propagation through ALU were observed with bias margins of +/-11% and +/-9%, respectively. Also at low speed, the ALU exhibited correct functionality for all arithmetical and logical instructions with +/-6% bias margins. We tested the 8-bit ALU for all instructions up to 2.8 GHz clock frequency.

Hardware Architecture

An 8-bit In Resistive Memory Computing Core with Regulated Passive Neuron and Bit Line Weight Mapping

146 - Yewei Zhang , Kejie Huang (Senior Member 2020

The rapid development of Artificial Intelligence (AI) and Internet of Things (IoT) increases the requirement for edge computing with low power and relatively high processing speed devices. The Computing-In-Memory(CIM) schemes based on emerging resistive Non-Volatile Memory(NVM) show great potential in reducing the power consumption for AI computing. However, the device inconsistency of the non-volatile memory may significantly degenerate the performance of the neural network. In this paper, we propose a low power Resistive RAM (RRAM) based CIM core to not only achieve high computing efficiency but also greatly enhance the robustness by bit line regulator and bit line weight mapping algorithm. The simulation results show that the power consumption of our proposed 8-bit CIM core is only 3.61mW (256*256). The SFDR and SNDR of the CIM core achieve 59.13 dB and 46.13 dB, respectively. The proposed bit line weight mapping scheme improves the top-1 accuracy by 2.46% and 3.47% for AlexNet and VGG16 on ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC 2012) in 8-bit mode, respectively.

Hardware Architecture

Energy-Efficient Accelerator Design for Deformable Convolution Networks

87 - Dawen Xu , Cheng Chu , Cheng Liu 2021

Deformable convolution networks (DCNs) proposed to address the image recognition with geometric or photometric variations typically involve deformable convolution that convolves on arbitrary locations of input features. The locations change with different inputs and induce considerable dynamic and irregular memory accesses which cannot be handled by classic neural network accelerators (NNAs). Moreover, bilinear interpolation (BLI) operation that is required to obtain deformed features in DCNs also cannot be deployed on existing NNAs directly. Although a general purposed processor (GPP) seated along with classic NNAs can process the deformable convolution, the processing on GPP can be extremely slow due to the lack of parallel computing capability. To address the problem, we develop a DCN accelerator on existing NNAs to support both the standard convolution and deformable convolution. Specifically, for the dynamic and irregular accesses in DCNs, we have both the input and output features divided into tiles and build a tile dependency table (TDT) to track the irregular tile dependency at runtime. With the TDT, we further develop an on-chip tile scheduler to handle the dynamic and irregular accesses efficiently. In addition, we propose a novel mapping strategy to enable parallel BLI processing on NNAs and apply layer fusion techniques for more energy-efficient DCN processing. According to our experiments, the proposed accelerator achieves orders of magnitude higher performance and energy efficiency compared to the typical computing architectures including ARM, ARM+TPU, and GPU with 6.6% chip area penalty to a classic NNA.

Hardware Architecture

74 - Zhangyu Chen , Yu Hua , Pengfei Zuo 2019

Image bitmaps have been widely used in in-memory applications, which consume lots of storage space and energy. Compared with legacy DRAM, non-volatile memories (NVMs) are suitable for bitmap storage due to the salient features in capacity and power savings. However, NVMs suffer from higher latency and energy consumption in writes compared with reads. Although compressing data in write accesses to NVMs on-the-fly reduces the bit-writes in NVMs, existing precise or approximate compression schemes show limited performance improvements for data of bitmaps, due to the irregular data patterns and variance in data. We observe that the data containing bitmaps show the pixel-level similarity due to the analogous contents in adjacent pixels. By exploiting the pixel-level similarity, we propose SimCom, an efficient similarity-aware compression scheme in hardware layer, to compress data for each write access on-the-fly. The idea behind SimCom is to compress continuous similar words into the pairs of base words with runs. With the aid of domain knowledge of images, SimCom adaptively selects an appropriate compression mode to achieve an efficient trade-off between image quality and memory performance. We implement SimCom on GEM5 with NVMain and evaluate the performance with real-world workloads. Our results demonstrate that SimCom reduces 33.0%, 34.8% write latency and saves 28.3%, 29.0% energy than state-of-the-art FPC and BDI with minor quality loss of 3%.

Hardware Architecture Performance

Bit Error Robustness for Energy-Efficient DNN Accelerators

64 - David Stutz , Nandhini Chandramoorthy , Matthias Hein 2020

Deep neural network (DNN) accelerators received considerable attention in past years due to saved energy compared to mainstream hardware. Low-voltage operation of DNN accelerators allows to further reduce energy consumption significantly, however, causes bit-level failures in the memory storing the quantized DNN weights. In this paper, we show that a combination of robust fixed-point quantization, weight clipping, and random bit error training (RandBET) improves robustness against random bit errors in (quantized) DNN weights significantly. This leads to high energy savings from both low-voltage operation as well as low-precision quantization. Our approach generalizes across operating voltages and accelerators, as demonstrated on bit errors from profiled SRAM arrays. We also discuss why weight clipping alone is already a quite effective way to achieve robustness against bit errors. Moreover, we specifically discuss the involved trade-offs regarding accuracy, robustness and precision: Without losing more than 1% in accuracy compared to a normally trained 8-bit DNN, we can reduce energy consumption on CIFAR-10 by 20%. Higher energy savings of, e.g., 30%, are possible at the cost of 2.5% accuracy, even for 4-bit DNNs.

Machine Learning Hardware Architecture Cryptography and Security