DNN-Chip Predictor: An Analytical Performance Predictor for DNN Accelerators with Various Dataflows and Hardware Architectures

305 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yang Zhao

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yang Zhao - Chaojian Li - Yue Wang

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية معالجة الإشارات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The recent breakthroughs in deep neural networks (DNNs) have spurred a tremendously increased demand for DNN accelerators. However, designing DNN accelerators is non-trivial as it often takes months/years and requires cross-disciplinary knowledge. To enable fast and effective DNN accelerator development, we propose DNN-Chip Predictor, an analytical performance predictor which can accurately predict DNN accelerators energy, throughput, and latency prior to their actual implementation. Our Predictor features two highlights: (1) its analytical performance formulation of DNN ASIC/FPGA accelerators facilitates fast design space exploration and optimization; and (2) it supports DNN accelerators with different algorithm-to-hardware mapping methods (i.e., dataflows) and hardware architectures. Experiment results based on 2 DNN models and 3 different ASIC/FPGA implementations show that our DNN-Chip Predictors predicted performance differs from those of chip measurements of FPGA/ASIC implementation by no more than 17.66% when using different DNN models, hardware architectures, and dataflows. We will release code upon acceptance.

قيم البحث

369 - Hyoukjun Kwon , Prasanth Chatarasi , Michael Pellauer 2018

The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchi tecture dictates the dataflow(s) that can be employed to execute a layer or network. Selecting an optimal dataflow for a layer shape can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflows, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centric directives to concisely specify the space of DNN dataflows in a compilerfriendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration (DSE) experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Bit Error Robustness for Energy-Efficient DNN Accelerators

64 - David Stutz , Nandhini Chandramoorthy , Matthias Hein 2020

Deep neural network (DNN) accelerators received considerable attention in past years due to saved energy compared to mainstream hardware. Low-voltage operation of DNN accelerators allows to further reduce energy consumption significantly, however, ca uses bit-level failures in the memory storing the quantized DNN weights. In this paper, we show that a combination of robust fixed-point quantization, weight clipping, and random bit error training (RandBET) improves robustness against random bit errors in (quantized) DNN weights significantly. This leads to high energy savings from both low-voltage operation as well as low-precision quantization. Our approach generalizes across operating voltages and accelerators, as demonstrated on bit errors from profiled SRAM arrays. We also discuss why weight clipping alone is already a quite effective way to achieve robustness against bit errors. Moreover, we specifically discuss the involved trade-offs regarding accuracy, robustness and precision: Without losing more than 1% in accuracy compared to a normally trained 8-bit DNN, we can reduce energy consumption on CIFAR-10 by 20%. Higher energy savings of, e.g., 30%, are possible at the cost of 2.5% accuracy, even for 4-bit DNNs.

التعلم الآلي هندسة العتاد التشفير والأمن

METRO: A Software-Hardware Co-Design of Interconnections for Spatial DNN Accelerators

414 - Zhao Wang , Guangyu Sun , Jingchen Zhu 2021

Tiled spatial architectures have proved to be an effective solution to build large-scale DNN accelerators. In particular, interconnections between tiles are critical for high performance in these tile-based architectures. In this work, we identify th e inefficiency of the widely used traditional on-chip networks and the opportunity of software-hardware co-design. We propose METRO with the basic idea of decoupling the traffic scheduling policies from hardware fabrics and moving them to the software level. METRO contains two modules working in synergy: METRO software scheduling framework to coordinate the traffics and METRO hardware facilities to deliver the data based on software configurations. We evaluate the co-design using different flit sizes for synthetic study, illustrating its effectiveness under various hardware resource constraints, in addition to a wide range of DNN models selected from real-world workloads. The results show that METRO achieves 56.3% communication speedup on average and up to 73.6% overall processing time reduction compared with traditional on-chip network designs.

هندسة العتاد

Memory-Efficient Pipeline-Parallel DNN Training

98 - Deepak Narayanan , Amar Phanishayee , Kaiyu Shi 2020

Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20$times$ with similar final model accuracy.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

Efficient Algorithms for Device Placement of DNN Graph Operators

120 - Jakub Tarnawski , Amar Phanishayee , Nikhil R. Devanur 2020

Modern machine learning workloads use large models, with complex structures, that are very expensive to execute. The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators b eing offered as hardware accelerators in addition to CPUs. These trends necessitate distributing the workload across multiple devices. Recent work has shown that significant gains can be obtained with model parallelism, i.e, partitioning a neural networks computational graph onto multiple devices. In particular, this form of parallelism assumes a pipeline of devices, which is fed a stream of samples and yields high throughput for training and inference of DNNs. However, for such settings (large models and multiple heterogeneous devices), we require automated algorithms and toolchains that can partition the ML workload across devices. In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings. We then provide algorithms that solve this problem to optimality. We demonstrate the applicability and efficiency of our approaches using several contemporary DNN computation graphs.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي