High-Performance Deep Learning via a Single Building Block

392 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Evangelos Georganas

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Evangelos Georganas - Kunal Banerjee - Dhiraj Kalamkar

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the batch-reduce GEMM kernel and show how the most popular DL algorithms can be formulated with this kernel as the basic building-block. Consequently, the DL library-development degenerates to mere (potentially automatic) tuning of loops around this sole optimized kernel. By exploiting our new kernel we implement Recurrent Neural Networks, Convolution Neural Networks and Multilayer Perceptron training and inference primitives in just 3K lines of high-level code. Our primitives outperform vendor-optimized libraries on multi-node CPU clusters, and we also provide proof-of-concept CNN kernels targeting GPUs. Finally, we demonstrate that the batch-reduce GEMM kernel within a tensor compiler yields high-performance CNN primitives, further amplifying the viability of our approach.

قيم البحث

83 - Georgios Damaskinos , Rachid Guerraoui , Anne-Marie Kermarrec 2020

Federated Learning (FL) is very appealing for its privacy benefits: essentially, a global model is trained with updates computed on mobile devices while keeping the data of users local. Standard FL infrastructures are however designed to have no ener gy or performance impact on mobile devices, and are therefore not suitable for applications that require frequent (online) model updates, such as news recommenders. This paper presents FLeet, the first Online FL system, acting as a middleware between the Android OS and the machine learning application. FLeet combines the privacy of Standard FL with the precision of online learning thanks to two core components: (i) I-Prof, a new lightweight profiler that predicts and controls the impact of learning tasks on mobile devices, and (ii) AdaSGD, a new adaptive learning algorithm that is resilient to delayed updates. Our extensive evaluation shows that Online FL, as implemented by FLeet, can deliver a 2.3x quality boost compared to Standard FL, while only consuming 0.036% of the battery per day. I-Prof can accurately control the impact of learning tasks by improving the prediction accuracy up to 3.6x (computation time) and up to 19x (energy). AdaSGD outperforms alternative FL approaches by 18.4% in terms of convergence speed on heterogeneous data.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

Robust Training in High Dimensions via Block Coordinate Geometric Median Descent

105 - Anish Acharya , Abolfazl Hashemi , Prateek Jain 2021

Geometric median (textsc{Gm}) is a classical method in statistics for achieving a robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 0.5. However, its computational complexity makes it infeas ible for robustifying stochastic gradient descent (SGD) for high-dimensional optimization problems. In this paper, we show that by applying textsc{Gm} to only a judiciously chosen block of coordinates at a time and using a memory mechanism, one can retain the breakdown point of 0.5 for smooth non-convex problems, with non-asymptotic convergence rates comparable to the SGD with textsc{Gm}.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التحسين والتحكم

PyTorch: An Imperative Style, High-Performance Deep Learning Library

183 - Adam Paszke , Sam Gross , Francisco Massa 2019

Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that suppor ts code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.

التعلم الآلي البرمجيات الرياضية التعلم الالي

Accelerating Deep Learning Inference via Learned Caches

74 - Arjun Balasubramanian , Adarsh Kumar , Yuhan Liu 2021

Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. However, this high accuracy has been achieved by building deeper networks, posing a fundamental challenge t o the low latency inference desired by user-facing applications. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We observe that caching hidden layer outputs of the DNN can introduce a form of late-binding where inference requests only consume the amount of computation needed. This enables a mechanism for achieving low latencies, coupled with an ability to exploit temporal locality. However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference. Results show that GATI can reduce inference latency by up to 7.69X on realistic workloads.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية الأداء

No Peek: A Survey of private distributed deep learning

127 - Praneeth Vepakomma , Tristan Swedish , Ramesh Raskar 2018

We survey distributed deep learning models for training or inference without accessing raw data from clients. These methods aim to protect confidential patterns in data while still allowing servers to train models. The distributed deep learning metho ds of federated learning, split learning and large batch stochastic gradient descent are compared in addition to private and secure approaches of differential privacy, homomorphic encryption, oblivious transfer and garbled circuits in the context of neural networks. We study their benefits, limitations and trade-offs with regards to computational resources, data leakage and communication efficiency and also share our anticipated future trends.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي