A Learned Performance Model for Tensor Processing Units

68 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Samuel Kaufman

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Samuel J. Kaufman - Phitchaya Mangpo Phothilimthana - Yanqi Zhou

الأداء التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks -- tile-size selection and operator fusion -- and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.

قيم البحث

152 - Norman P. Jouppi , Cliff Young , Nishant Patil 2017

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates t he inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPUs deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPUs GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

هندسة العتاد التعلم الآلي الحوسبة العصبية والتطورية

A TensorFlow Simulation Framework for Scientific Computing of Fluid Flows on Tensor Processing Units

141 - Qing Wang , Matthias Ihme , Yi-Fan Chen 2021

A computational fluid dynamics (CFD) simulation framework for predicting complex flows is developed on the Tensor Processing Unit (TPU) platform. The TPU architecture is featured with accelerated performance of dense matrix multiplication, large high bandwidth memory, and a fast inter-chip interconnect, which makes it attractive for high-performance scientific computing. The CFD framework solves the variable-density Navier-Stokes equation using a Low-Mach approximation, and the governing equations are discretized by a finite difference method on a collocated structured mesh. It uses the graph-based TensorFlow as the programming paradigm. The accuracy and performance of this framework is studied both numerically and analytically, specifically focusing on effects of TPU-native single precision floating point arithmetic on solution accuracy. The algorithm and implementation are validated with canonical 2D and 3D Taylor Green vortex simulations. To demonstrate the capability for simulating turbulent flows, simulations are conducted for two configurations, namely the decaying homogeneous isotropic turbulence and a turbulent planar jet. Both simulations show good statistical agreement with reference solutions. The performance analysis shows a linear weak scaling and a super-linear strong scaling up to a full TPU v3 pod with 2048 cores.

الفيزياء الحسابية التحليل العددي التحليل العددي

A Metric for Performance Portability

75 - S. J. Pennycook , J. D. Sewall , V. W. Lee 2016

The term performance portability has been informally used in computing to refer to a variety of notions which generally include: 1) the ability to run one application across multiple hardware platforms; and 2) achieving some notional level of perform ance on these platforms. However, there has been a noticeable lack of consensus on the precise meaning of the term, and authors conclusions regarding their success (or failure) to achieve performance portability have thus been subjective. Comparing one approach to performance portability with another has generally been marked with vague claims and verbose, qualitative explanation of the comparison. This paper presents a concise definition for performance portability, along with a simple metric that accurately captures the performance and portability of an application across different platforms. The utility of this metric is then demonstrated with a retroactive application to previous work.

الأداء

Performance Acceleration of Kernel Polynomial Method Applying Graphics Processing Units

153 - Shixun Zhang , Shinichi Yamagiwa , Masahiko Okumura 2011

The Kernel Polynomial Method (KPM) is one of the fast diagonalization methods used for simulations of quantum systems in research fields of condensed matter physics and chemistry. The algorithm has a difficulty to be parallelized on a cluster compute r or a supercomputer due to the fine-gain recursive calculations. This paper proposes an implementation of the KPM on the recent graphics processing units (GPU) where the recursive calculations are able to be parallelized in the massively parallel environment. This paper also illustrates performance evaluations regarding the cases when the actual simulation parameters are applied, the one for increased intensive calculations and the one for increased amount of memory usage. Finally, it concludes that the performance on GPU promises very high performance compared to the one on CPU and reduces the overall simulation time.

الفيزياء الحسابية مسألة أخرى مكثفة الأداء

Delay and Price Differentiation in Cloud Computing: A Service Model, Supporting Architectures, and Performance

134 - Xiaohu Wu , Francesco De Pellegrini , Giuliano Casale 2020

Many cloud service providers (CSPs) provide on-demand service at a price with a small delay. We propose a QoS-differentiated model where multiple SLAs deliver both on-demand service for latency-critical users and delayed services for delay-tolerant u sers at lower prices. Two architectures are considered to fulfill SLAs. The first is based on priority queues. The second simply separates servers into multiple modules, each for one SLA. As an ecosystem, we show that the proposed framework is dominant-strategy incentive compatible. Although the first architecture appears more prevalent in the literature, we prove the superiority of the second architecture, under which we further leverage queueing theory to determine the optimal SLA delays and prices. Finally, the viability of the proposed framework is validated through numerical comparison with the on-demand service and it exhibits a revenue improvement in excess of 200%. Our results can help CSPs design optimal delay-differentiated services and choose appropriate serving architectures.

الأداء علوم الكمبيوتر ونظرية الألعاب