Machine Learning Systems for Highly-Distributed and Rapidly-Growing Data

167 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Kevin Hsieh

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Kevin Hsieh

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The usability and practicality of any machine learning (ML) applications are largely influenced by two critical but hard-to-attain factors: low latency and low cost. Unfortunately, achieving low latency and low cost is very challenging when ML depends on real-world data that are highly distributed and rapidly growing (e.g., data collected by mobile phones and video cameras all over the world). Such real-world data pose many challenges in communication and computation. For example, when training data are distributed across data centers that span multiple continents, communication among data centers can easily overwhelm the limited wide-area network bandwidth, leading to prohibitively high latency and high cost. In this dissertation, we demonstrate that the latency and cost of ML on highly-distributed and rapidly-growing data can be improved by one to two orders of magnitude by designing ML systems that exploit the characteristics of ML algorithms, ML model structures, and ML training/serving data. We support this thesis statement with three contributions. First, we design a system that provides both low-latency and low-cost ML serving (inferencing) over large-scale and continuously-growing datasets, such as videos. Second, we build a system that makes ML training over geo-distributed datasets as fast as training within a single data center. Third, we present a first detailed study and a system-level solution on a fundamental and largely overlooked problem: ML training over non-IID (i.e., not independent and identically distributed) data partitions (e.g., facial images collected by cameras varies according to the demographics of each cameras location).

قيم البحث

102 - Weiyan Wang , Cengguang Zhang , Liu Yang 2020

Distributed Machine Learning suffers from the bottleneck of synchronization to all-reduce workers updates. Previous works mainly consider better network topology, gradient compression, or stale updates to speed up communication and relieve the bottle neck. However, all these works ignore the importance of reducing the scale of synchronized elements and inevitable serial executed operators. To address the problem, our work proposes the Divide-and-Shuffle Synchronization(DS-Sync), which divides workers into several parallel groups and shuffles group members. DS-Sync only synchronizes the workers in the same group so that the scale of a group is much smaller. The shuffle of workers maintains the algorithms convergence speed, which is interpreted in theory. Comprehensive experiments also show the significant improvements in the latest and popular models like Bert, WideResnet, and DeepFM on challenging datasets.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

Scaling-up Distributed Processing of Data Streams for Machine Learning

66 - Matthew Nokleby , Haroon Raja , 2020

Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. Real-time incorporation of streaming data into the learned models is essential for improved inference in these applications . Further, these applications often involve data that are either inherently gathered at geographically distributed entities or that are intentionally distributed across multiple machines for memory, computational, and/or privacy reasons. Training of models in this distributed, streaming setting requires solving stochastic optimization problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared to the processing capabilities of compute nodes and/or the rate of the communications links, this poses a challenging question: how can one best leverage the incoming data for distributed training under constraints on computing capabilities and/or communications rate? A large body of research has emerged in recent decades to tackle this and related problems. This paper reviews recently developed methods that focus on large-scale distributed stochastic optimization in the compute- and bandwidth-limited regime, with an emphasis on convergence analysis that explicitly accounts for the mismatch between computation, communication and streaming rates. In particular, it focuses on methods that solve: (i) distributed stochastic convex problems, and (ii) distributed principal component analysis, which is a nonconvex problem with geometric structure that permits global convergence. For such methods, the paper discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. Further, it reviews guarantees underlying these methods, which show there exist regimes in which systems can learn from distributed, streaming data at order-optimal rates.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية معالجة الإشارات

Toward Understanding the Impact of Staleness in Distributed Machine Learning

283 - Wei Dai , Yi Zhou , Nanqing Dong 2018

Many distributed machine learning (ML) systems adopt the non-synchronous execution in order to alleviate the network communication bottleneck, resulting in stale parameters that do not reflect the latest updates. Despite much development in large-sca le ML, the effects of staleness on learning are inconclusive as it is challenging to directly monitor or control staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of stochastic gradient descent in non-convex optimization under staleness, matching the best-known convergence rate of O(1/sqrt{T}).

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning

102 - Anis Elgabli , Jihong Park , Amrit S. Bedi 2019

When the data is distributed across multiple servers, lowering the communication cost between the servers (or workers) while solving the distributed learning problem is an important problem and is the focus of this paper. In particular, we propose a fast, and communication-efficient decentralized framework to solve the distributed machine learning (DML) problem. The proposed algorithm, Group Alternating Direction Method of Multipliers (GADMM) is based on the Alternating Direction Method of Multipliers (ADMM) framework. The key novelty in GADMM is that it solves the problem in a decentralized topology where at most half of the workers are competing for the limited communication resources at any given time. Moreover, each worker exchanges the locally trained model only with two neighboring workers, thereby training a global model with a lower amount of communication overhead in each exchange. We prove that GADMM converges to the optimal solution for convex loss functions, and numerically show that it converges faster and more communication-efficient than the state-of-the-art communication-efficient algorithms such as the Lazily Aggregated Gradient (LAG) and dual averaging, in linear and logistic regression tasks on synthetic and real datasets. Furthermore, we propose Dynamic GADMM (D-GADMM), a variant of GADMM, and prove its convergence under the time-varying network topology of the workers.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية نظرية المعلومات

Distributed Machine Learning via Sufficient Factor Broadcasting

205 - Pengtao Xie , Jin Kyu Kim , Yi Zhou 2015

Matrix-parametrized models, including multiclass logistic regression and sparse coding, are used in machine learning (ML) applications ranging from computer vision to computational biology. When these models are applied to large-scale ML problems sta rting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter synchronization costs that greatly slow down distributed learning. To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrix-parameterized models, which share the following property: the parameter update computed on each data sample is a rank-1 matrix, i.e., the outer product of two sufficient factors (SFs). By broadcasting the SFs among worker machines and reconstructing the update matrices locally at each worker, SFB improves communication efficiency --- communication costs are linear in the parameter matrixs dimensions, rather than quadratic --- without affecting computational correctness. We present a theoretical convergence analysis of SFB, and empirically corroborate its efficiency on four different matrix-parametrized ML models.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية