Distributed Training of Deep Learning Models: A Taxonomic Perspective

175 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Matthias Langer

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Matthias Langer - Zhen He - Wenny Rahayu

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. Developers of DDLS are required to make many decisions to process their particular workloads in their chosen environment efficiently. The advent of GPU-based deep learning, the ever-increasing size of datasets and deep neural network models, in combination with the bandwidth constraints that exist in cluster environments require developers of DDLS to be innovative in order to train high quality models quickly. Comparing DDLS side-by-side is difficult due to their extensive feature lists and architectural deviations. We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines by analyzing the general properties associated with training deep learning models and how such workloads can be distributed in a cluster to achieve collaborative model training. Thereby we provide an overview of the different techniques that are used by contemporary DDLS and discuss their influence and implications on the training process. To conceptualize and compare DDLS, we group different techniques into categories, thus establishing a taxonomy of distributed deep learning systems.

قيم البحث

119 - Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu 2017

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we fo cus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks and overheads that hinter data parallelism. We then devise guidelines that help practitioners to configure an effective system and fine-tune parameters to achieve desired speedup. Specifically, we develop a procedure for setting minibatch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي التعلم الالي

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

198 - LingFei Dai , Boyu Diao , Chao Li 2021

Distributed training is an effective way to accelerate the training process of large-scale deep learning models. However, the parameter exchange and synchronization of distributed stochastic gradient descent introduce a large amount of communication overhead. Gradient compression is an effective method to reduce communication overhead. In synchronization SGD compression methods, many Top-k sparsification based gradient compression methods have been proposed to reduce the communication. However, the centralized method based on the parameter servers has the single point of failure problem and limited scalability, while the decentralized method with global parameter exchanging may reduce the convergence rate of training. In contrast with Top-$k$ based methods, we proposed a gradient compression method with globe gradient vector sketching, which uses the Count-Sketch structure to store the gradients to reduce the loss of the accuracy in the training process, named global-sketching SGD (gs-SGD). The gs-SGD has better convergence efficiency on deep learning models and a communication complexity of O($log d*log P$), where $d$ is the number of model parameters and P is the number of workers. We conducted experiments on GPU clusters to verify that our method has better convergence efficiency than global Top-$k$ and Sketching-based methods. In addition, gs-SGD achieves 1.3-3.1x higher throughput compared with gTop-$k$, and 1.1-1.2x higher throughput compared with original Sketched-SGD.

النظم الموزعة والتوازية والحوسبة العنقودية

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

226 - Shaohuai Shi , Xianhao Zhou , Shutao Song 2020

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional stat e-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

275 - Dipankar Das , Sasikanth Avancha , Dheevatsa Mudigere 2016

We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for dif ferent networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record training throughputs. A 512 minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64 node cluster. We also demonstrate the generality of our approach via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt to democratize deep-learning by training on an Ethernet based AWS cluster and show ~14X scaling on 16 nodes.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

RPC Considered Harmful: Fast Distributed Deep Learning on RDMA

106 - Jilong Xue , Youshan Miao , Cheng Chen 2018

Deep learning emerges as an important new resource-intensive workload and has been successfully applied in computer vision, speech, natural language processing, and so on. Distributed deep learning is becoming a necessity to cope with growing data an d model sizes. Its computation is typically characterized by a simple tensor data abstraction to model multi-dimensional matrices, a data-flow graph to model computation, and iterative executions with relatively frequent synchronizations, thereby making it substantially different from Map/Reduce style distributed big data computation. RPC, commonly used as the communication primitive, has been adopted by popular deep learning frameworks such as TensorFlow, which uses gRPC. We show that RPC is sub-optimal for distributed deep learning computation, especially on an RDMA-capable network. The tensor abstraction and data-flow graph, coupled with an RDMA network, offers the opportunity to reduce the unnecessary overhead (e.g., memory copy) without sacrificing programmability and generality. In particular, from a data access point of view, a remote machine is abstracted just as a device on an RDMA channel, with a simple memory interface for allocating, reading, and writing memory regions. Our graph analyzer looks at both the data flow graph and the tensors to optimize memory allocation and remote data access using this interface. The result is up to 25 times speedup in representative deep learning benchmarks against the standard gRPC in TensorFlow and up to 169% improvement even against an RPC implementation optimized for RDMA, leading to faster convergence in the training process.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي