Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

88 0 0.0 ( 0 )

Download Cite

Added by Qinqing Zheng

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Qinqing Zheng - Bor-Yiing Su - Jiyan Yang

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Recommendation systems are often trained with a tremendous amount of data, and distributed training is the workhorse to shorten the training time. While the training throughput can be increased by simply adding more workers, it is also increasingly challenging to preserve the model quality. In this paper, we present shadowsync, a distributed framework specifically tailored to modern scale recommendation system training. In contrast to previous works where synchronization happens as part of the training process, shadowsync separates the synchronization from training and runs it in the background. Such isolation significantly reduces the synchronization overhead and increases the synchronization frequency, so that we are able to obtain both high throughput and excellent model quality when training at scale. The superiority of our procedure is confirmed by experiments on training deep neural networks for click-through-rate prediction tasks. Our framework is capable to express data parallelism and/or model parallelism, generic to host various types of synchronization algorithms, and readily applicable to large scale problems in other areas.

rate research

DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

141 - Vasimuddin Md , Sanchit Misra , Guixiang Ma 2021

Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters via an efficient shared memory implementation, communication reduction using a minimum vertex-cut graph partitioning algorithm and communication avoidance using a family of delayed-update algorithms. Our results on four common GNN benchmark datasets: Reddit, OGB-Products, OGB-Papers and Proteins, show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets, respectively, over baseline DGL implementations running on a single CPU socket

Machine Learning Distributed Parallel and Cluster Computing

Divide-and-Shuffle Synchronization for Distributed Machine Learning

102 - Weiyan Wang , Cengguang Zhang , Liu Yang 2020

Distributed Machine Learning suffers from the bottleneck of synchronization to all-reduce workers updates. Previous works mainly consider better network topology, gradient compression, or stale updates to speed up communication and relieve the bottleneck. However, all these works ignore the importance of reducing the scale of synchronized elements and inevitable serial executed operators. To address the problem, our work proposes the Divide-and-Shuffle Synchronization(DS-Sync), which divides workers into several parallel groups and shuffles group members. DS-Sync only synchronizes the workers in the same group so that the scale of a group is much smaller. The shuffle of workers maintains the algorithms convergence speed, which is interpreted in theory. Comprehensive experiments also show the significant improvements in the latest and popular models like Bert, WideResnet, and DeepFM on challenging datasets.

Machine Learning Distributed Parallel and Cluster Computing Machine Learning

Machine Learning Systems for Highly-Distributed and Rapidly-Growing Data

166 - Kevin Hsieh 2019

The usability and practicality of any machine learning (ML) applications are largely influenced by two critical but hard-to-attain factors: low latency and low cost. Unfortunately, achieving low latency and low cost is very challenging when ML depends on real-world data that are highly distributed and rapidly growing (e.g., data collected by mobile phones and video cameras all over the world). Such real-world data pose many challenges in communication and computation. For example, when training data are distributed across data centers that span multiple continents, communication among data centers can easily overwhelm the limited wide-area network bandwidth, leading to prohibitively high latency and high cost. In this dissertation, we demonstrate that the latency and cost of ML on highly-distributed and rapidly-growing data can be improved by one to two orders of magnitude by designing ML systems that exploit the characteristics of ML algorithms, ML model structures, and ML training/serving data. We support this thesis statement with three contributions. First, we design a system that provides both low-latency and low-cost ML serving (inferencing) over large-scale and continuously-growing datasets, such as videos. Second, we build a system that makes ML training over geo-distributed datasets as fast as training within a single data center. Third, we present a first detailed study and a system-level solution on a fundamental and largely overlooked problem: ML training over non-IID (i.e., not independent and identically distributed) data partitions (e.g., facial images collected by cameras varies according to the demographics of each cameras location).

Machine Learning Distributed Parallel and Cluster Computing Machine Learning

Tune: A Research Platform for Distributed Model Selection and Training

305 - Richard Liaw , Eric Liang , Robert Nishihara 2018

Modern machine learning algorithms are increasingly computationally demanding, requiring specialized hardware and distributed computation to achieve high performance in a reasonable time frame. Many hyperparameter search algorithms have been proposed for improving the efficiency of model selection, however their adaptation to the distributed compute environment is often ad-hoc. We propose Tune, a unified framework for model selection and training that provides a narrow-waist interface between training scripts and search algorithms. We show that this interface meets the requirements for a broad range of hyperparameter search algorithms, allows straightforward scaling of search to large clusters, and simplifies algorithm implementation. We demonstrate the implementation of several state-of-the-art hyperparameter search algorithms in Tune. Tune is available at http://ray.readthedocs.io/en/latest/tune.html.

Machine Learning Distributed Parallel and Cluster Computing Machine Learning

Convolutional Neural Network Training with Distributed K-FAC

386 - J. Gregory Pauloski , Zhao Zhang , Lei Huang 2020

Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our distributed K-FAC implementation converges to the 75.9% MLPerf baseline in 18-25% less time than does the classic stochastic gradient descent (SGD) optimizer across scales on a GPU cluster.

Machine Learning Distributed Parallel and Cluster Computing Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions