Interstellar: Using Halides Scheduling Language to Analyze DNN Accelerators

240 0 0.0 ( 0 )

Download Cite

Added by Xuan Yang

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Xuan Yang - Mingyu Gao - Qiaoyi Liu

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halides scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.

rate research

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

355 - Menglu Yu , Chuan Wu , Bo Ji 2021

In recent years, to sustain the resource-intensive computational needs for training deep neural networks (DNNs), it is widely accepted that exploiting the parallelism in large-scale computing clusters is critical for the efficient deployments of DNN training jobs. However, existing resource schedulers for traditional computing clusters are not well suited for DNN training, which results in unsatisfactory job completion time performance. The limitations of these resource scheduling schemes motivate us to propose a new computing cluster resource scheduling framework that is able to leverage the special layered structure of DNN jobs and significantly improve their job completion times. Our contributions in this paper are three-fold: i) We develop a new resource scheduling analytical model by considering DNNs layered structure, which enables us to analytically formulate the resource scheduling optimization problem for DNN training in computing clusters; ii) Based on the proposed performance analytical model, we then develop an efficient resource scheduling algorithm based on the widely adopted parameter-server architecture using a sum-of-ratios multi-dimensional-knapsack decomposition (SMD) method to offer strong performance guarantee; iii) We conduct extensive numerical experiments to demonstrate the effectiveness of the proposed schedule algorithm and its superior performance over the state of the art.

Distributed Parallel and Cluster Computing

Power-Based Attacks on Spatial DNN Accelerators

88 - Ge Li , Mohit Tiwari , 2021

With proliferation of DNN-based applications, the confidentiality of DNN model is an important commercial goal. Spatial accelerators, that parallelize matrix/vector operations, are utilized for enhancing energy efficiency of DNN computation. Recently, model extraction attacks on simple accelerators, either with a single processing element or running a binarized network, were demonstrated using the methodology derived from differential power analysis (DPA) attack on cryptographic devices. This paper investigates the vulnerability of realistic spatial accelerators using general, 8-bit, number representation. We investigate two systolic array architectures with weight-stationary dataflow: (1) a 3 $times$ 1 array for a dot-product operation, and (2) a 3 $times$ 3 array for matrix-vector multiplication. Both are implemented on the SAKURA-G FPGA board. We show that both architectures are ultimately vulnerable. A conventional DPA succeeds fully on the 1D array, requiring 20K power measurements. However, the 2D array exhibits higher security even with 460K traces. We show that this is because the 2D array intrinsically entails multiple MACs simultaneously dependent on the same input. However, we find that a novel template-based DPA with multiple profiling phases is able to fully break the 2D array with only 40K traces. Corresponding countermeasures need to be investigated for spatial DNN accelerators.

Cryptography and Security Machine Learning

Bit Error Robustness for Energy-Efficient DNN Accelerators

64 - David Stutz , Nandhini Chandramoorthy , Matthias Hein 2020

Deep neural network (DNN) accelerators received considerable attention in past years due to saved energy compared to mainstream hardware. Low-voltage operation of DNN accelerators allows to further reduce energy consumption significantly, however, causes bit-level failures in the memory storing the quantized DNN weights. In this paper, we show that a combination of robust fixed-point quantization, weight clipping, and random bit error training (RandBET) improves robustness against random bit errors in (quantized) DNN weights significantly. This leads to high energy savings from both low-voltage operation as well as low-precision quantization. Our approach generalizes across operating voltages and accelerators, as demonstrated on bit errors from profiled SRAM arrays. We also discuss why weight clipping alone is already a quite effective way to achieve robustness against bit errors. Moreover, we specifically discuss the involved trade-offs regarding accuracy, robustness and precision: Without losing more than 1% in accuracy compared to a normally trained 8-bit DNN, we can reduce energy consumption on CIFAR-10 by 20%. Higher energy savings of, e.g., 30%, are possible at the cost of 2.5% accuracy, even for 4-bit DNNs.

Machine Learning Hardware Architecture Cryptography and Security

Extending Sparse Tensor Accelerators to Support Multiple Compression Formats

194 - Eric Qin , Geonhwa Jeong , William Won 2021

Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enabled by them vary across tensor dimensions and amount of sparsity. Since DL and scientific workloads span across all sparsity regions, there can be numerous format combinations for optimizing memory and compute efficiency. Unfortunately, many proposed accelerators operate on one or two fixed format combinations. This work proposes hardware extensions to accelerators for supporting numerous format combinations seamlessly and demonstrates ~4X speedup over performing format

Distributed Parallel and Cluster Computing

Learning to Optimize DAG Scheduling in Heterogeneous Environment

112 - Jinhong Luo , Xijun Li , Mingxuan Yuan 2021

Directed Acyclic Graph (DAG) scheduling in a heterogeneous environment is aimed at assigning the on-the-fly jobs to a cluster of heterogeneous computing executors in order to minimize the makespan while meeting all requirements of scheduling. The problem gets more attention than ever since the rapid development of heterogeneous cloud computing. A little reduction of makespan of DAG scheduling could both bring huge profits to the service providers and increase the level of service of users. Although DAG scheduling plays an important role in cloud computing industries, existing solutions still have huge room for improvement, especially in making use of topological dependencies between jobs. In this paper, we propose a task-duplication based learning algorithm, called textit{Lachesis}, for the distributed DAG scheduling problem. In our approach, it first perceives the topological dependencies between jobs using a specially designed graph convolutional network (GCN) to select the most likely task to be executed. Then the task is assigned to a specific executor with the consideration of duplicating all its precedent tasks according to a sophisticated heuristic method. We have conducted extensive experiments over standard workload data to evaluate our solution. The experimental results suggest that the proposed algorithm can achieve at most 26.7% reduction of makespan and 35.2% improvement of speedup ratio over seven strong baseline algorithms, including state-of-the-art heuristics methods and a variety of deep reinforcement learning based algorithms.

Distributed Parallel and Cluster Computing