No Arabic abstract
Being able to learn from complex data with phase information is imperative for many signal processing applications. Today s real-valued deep neural networks (DNNs) have shown efficiency in latent information analysis but fall short when applied to the complex domain. Deep complex networks (DCN), in contrast, can learn from complex data, but have high computational costs; therefore, they cannot satisfy the instant decision-making requirements of many deployable systems dealing with short observations or short signal bursts. Recent, Binarized Complex Neural Network (BCNN), which integrates DCNs with binarized neural networks (BNN), shows great potential in classifying complex data in real-time. In this paper, we propose a structural pruning based accelerator of BCNN, which is able to provide more than 5000 frames/s inference throughput on edge devices. The high performance comes from both the algorithm and hardware sides. On the algorithm side, we conduct structural pruning to the original BCNN models and obtain 20 $times$ pruning rates with negligible accuracy loss; on the hardware side, we propose a novel 2D convolution operation accelerator for the binary complex neural network. Experimental results show that the proposed design works with over 90% utilization and is able to achieve the inference throughput of 5882 frames/s and 4938 frames/s for complex NIN-Net and ResNet-18 using CIFAR-10 dataset and Alveo U280 Board.
Binarized neural networks, or BNNs, show great promise in edge-side applications with resource limited hardware, but raise the concerns of reduced accuracy. Motivated by the complex neural networks, in this paper we introduce complex representation into the BNNs and propose Binary complex neural network -- a novel network design that processes binary complex inputs and weights through complex convolution, but still can harvest the extraordinary computation efficiency of BNNs. To ensure fast convergence rate, we propose novel BCNN based batch normalization function and weight initialization function. Experimental results on Cifar10 and ImageNet using state-of-the-art network models (e.g., ResNet, ResNetE and NIN) show that BCNN can achieve better accuracy compared to the original BNN models. BCNN improves BNN by strengthening its learning capability through complex representation and extending its applicability to complex-valued input data. The source code of BCNN will be released on GitHub.
Research has shown that deep neural networks contain significant redundancy, and thus that high classification accuracy can be achieved even when weights and activations are quantized down to binary values. Network binarization on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGAs fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We describe the realization of both unrolled and tiled LUTNet architectures, with the latter facilitating smaller, less power-hungry deployment over the former while sacrificing area and energy efficiency along with throughput. For both varieties, we demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarized neural network implementation, we achieve up to twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.
The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this paper, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint and energy reductions while inducing little to no accuracy loss vs Courbariaux & Bengios standard approach. These resource decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees coincident memory requirement and energy consumption drops of 2--6$times$, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.12$times$ memory reduction. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency and safeguarding privacy.
Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this paper, we propose a framework called FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. Experimental results show that FPDeep has good scalability to a large number of FPGAs, with the limiting factor being the FPGA-to-FPGA bandwidth. With 6 transceivers per FPGA, FPDeep shows linearity up to 83 FPGAs. Energy efficiency is evaluated with respect to GOPs/J. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.
In this paper, we propose a novel progressive parameter pruning method for Convolutional Neural Network acceleration, named Structured Probabilistic Pruning (SPP), which effectively prunes weights of convolutional layers in a probabilistic manner. Unlike existing deterministic pruning approaches, where unimportant weights are permanently eliminated, SPP introduces a pruning probability for each weight, and pruning is guided by sampling from the pruning probabilities. A mechanism is designed to increase and decrease pruning probabilities based on importance criteria in the training process. Experiments show that, with 4x speedup, SPP can accelerate AlexNet with only 0.3% loss of top-5 accuracy and VGG-16 with 0.8% loss of top-5 accuracy in ImageNet classification. Moreover, SPP can be directly applied to accelerate multi-branch CNN networks, such as ResNet, without specific adaptations. Our 2x speedup ResNet-50 only suffers 0.8% loss of top-5 accuracy on ImageNet. We further show the effectiveness of SPP on transfer learning tasks.