ﻻ يوجد ملخص باللغة العربية
It remains a challenge to run Deep Learning in devices with stringent power budget in the Internet-of-Things. This paper presents a low-power accelerator for processing Deep Neural Networks in the embedded devices. The power reduction is realized by avoiding multiplications of near-zero valued data. The near-zero approximation and a dedicated Near-Zero Approximation Unit (NZAU) are proposed to predict and skip the near-zero multiplications under certain thresholds. Compared with skipping zero-valued computations, our design achieves 1.92X and 1.51X further reduction of the total multiplications in LeNet-5 and Alexnet respectively, with negligible lose of accuracy. In the proposed accelerator, 256 multipliers are grouped into 16 independent Processing Lanes (PL) to support up to 16 neuron activations simultaneously. With the help of data pre-processing and buffering in each PL, multipliers can be clock-gated in most of the time even the data is excessively streaming in. Designed and simulated in UMC 65 nm process, the accelerator operating at 500 MHz is $>$ 4X faster than the mobile GPU Tegra K1 in processing the fully-connected layer FC8 of Alexnet, while consuming 717X less energy.
Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic
Energy efficiency and computing flexibility are some of the primary design constraints of heterogeneous computing. In this paper, we present FlashAbacus, a data-processing accelerator that self-governs heterogeneous kernel executions and data storage
The number of parameters in deep neural networks (DNNs) is scaling at about 5$times$ the rate of Moores Law. To sustain the pace of growth of the DNNs, new technologies and computing architectures are needed. Photonic computing systems are promising
Deep convolutional networks are well-known for their high computational and memory demands. Given limited resources, how does one design a network that balances its size, training time, and prediction accuracy? A surprisingly effective approach to tr
Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumptio