Towards Demystifying Serverless Machine Learning Training

278 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jiawei Jiang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jiawei Jiang - Shaoduo Gan - Yue Liu

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over serverful infrastructures (IaaS). In this paper we present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimization algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless infrastructure. Our results indicate that ML training pays off in serverless only for models with efficient (i.e., reduced) communication and that quickly converge. In general, FaaS can be much faster but it is never significantly cheaper than IaaS.

قيم البحث

109 - Malte S. Kurz 2021

This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing. It a llows to get fast on-demand estimations without additional cloud maintenance effort. We provide a prototype Python implementation texttt{DoubleML-Serverless} for the estimation of double machine learning models with the serverless computing platform AWS Lambda and demonstrate its utility with a case study analyzing estimation times and costs.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي التعلم الالي

Harvesting Idle Resources in Serverless Computing via Reinforcement Learning

430 - Hanfei Yu , Hao Wang , Jian Li 2021

Serverless computing has become a new cloud computing paradigm that promises to deliver high cost-efficiency and simplified cloud deployment with automated resource scaling at a fine granularity. Users decouple a cloud application into chained functi ons and preset each serverless functions memory and CPU demands at megabyte-level and core-level, respectively. Serverless platforms then automatically scale the number of functions to accommodate the workloads. However, the complexities of chained functions make it non-trivial to accurately determine the resource demands of each function for users, leading to either resource over-provision or under-provision for individual functions. This paper presents FaaSRM, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from functions over-supplied to functions under-supplied. FaaSRM monitors each functions resource utilization in real-time, detects over-provisioning and under-provisioning, and applies deep reinforcement learning to harvest idle resources safely using a safeguard mechanism and accelerate functions efficiently. We have implemented and deployed a FaaSRM prototype in a 13-node Apache OpenWhisk cluster. Experimental results on the OpenWhisk cluster show that FaaSRM reduces the execution time of 98% of function invocations by 35.81% compared to the baseline RMs by harvesting idle resources from 38.8% of the invocations and accelerating 39.2% of the invocations.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Byzantine-Tolerant Machine Learning

83 - Peva Blanchard , El Mahdi El Mhamdi , Rachid Guerraoui 2017

The growth of data, the need for scalability and the complexity of models used in modern machine learning calls for distributed implementations. Yet, as of today, distributed machine learning frameworks have largely ignored the possibility of arbitra ry (i.e., Byzantine) failures. In this paper, we study the robustness to Byzantine failures at the fundamental level of stochastic gradient descent (SGD), the heart of most machine learning algorithms. Assuming a set of $n$ workers, up to $f$ of them being Byzantine, we ask how robust can SGD be, without limiting the dimension, nor the size of the parameter space. We first show that no gradient descent update rule based on a linear combination of the vectors proposed by the workers (i.e, current approaches) tolerates a single Byzantine failure. We then formulate a resilience property of the update rule capturing the basic requirements to guarantee convergence despite $f$ Byzantine workers. We finally propose Krum, an update rule that satisfies the resilience property aforementioned. For a $d$-dimensional learning problem, the time complexity of Krum is $O(n^2 cdot (d + log n))$.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي الحوسبة العصبية والتطورية

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

226 - Shaohuai Shi , Xianhao Zhou , Shutao Song 2020

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional stat e-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي

Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach

110 - Peng Sun , Yonggang Wen , Ta Nguyen Binh Duong 2017

Many cluster management systems (CMSs) have been proposed to share a single cluster with multiple distributed computing systems. However, none of the existing approaches can handle distributed machine learning (ML) workloads given the following crite ria: high resource utilization, fair resource allocation and low sharing overhead. To solve this problem, we propose a new CMS named Dorm, incorporating a dynamically-partitioned cluster management mechanism and an utilization-fairness optimizer. Specifically, Dorm uses the container-based virtualization technique to partition a cluster, runs one application per partition, and can dynamically resize each partition at application runtime for resource efficiency and fairness. Each application directly launches its tasks on the assigned partition without petitioning for resources frequently, so Dorm imposes flat sharing overhead. Extensive performance evaluations showed that Dorm could simultaneously increase the resource utilization by a factor of up to 2.32, reduce the fairness loss by a factor of up to 1.52, and speed up popular distributed ML applications by a factor of up to 2.72, compared to existing approaches. Dorms sharing overhead is less than 5% in most cases.

النظم الموزعة والتوازية والحوسبة العنقودية