New community

Subscribe to the gold package and get unlimited access to Shamra Academy

GSPMD: General and Scalable Parallelization for ML Computation Graphs

90 0 0.0 ( 0 )

Download Cite

Added by Yuanzhong Xu

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Yuanzhong Xu - HyoukJoong Lee - Dehao Chen

Distributed Parallel and Cluster Computing Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computation graphs. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation. Its representation of partitioning is simple yet general, allowing it to express different or mixed paradigms of parallelism on a wide variety of models. GSPMD infers the partitioning for every operator in the graph based on limited user annotations, making it convenient to scale up existing single-device programs. It solves several technical challenges for production usage, such as static shape constraints, uneven partitioning, exchange of halo data, and nested operator partitioning. These techniques allow GSPMD to achieve 50% to 62% compute utilization on 128 to 2048 Cloud TPUv3 cores for models with up to one trillion parameters. GSPMD produces a single program for all devices, which adjusts its behavior based on a run-time partition ID, and uses collective operators for cross-device communication. This property allows the system itself to be scalable: the compilation time stays constant with increasing number of devices.

rate research

Blink: Fast and Generic Collectives for Distributed ML

356 - Guanhua Wang , Shivaram Venkataraman , Amar Phanishayee 2019

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40%.

Distributed Parallel and Cluster Computing Machine Learning

Revelio: ML-Generated Debugging Queries for Distributed Systems

69 - Pradeep Dogga 2021

A major difficulty in debugging distributed systems lies in manually determining which of the many available debugging tools to use and how to query its logs. Our own study of a production debugging workflow confirms the magnitude of this burden. This paper explores whether a machine-learning model can assist developers in distributed systems debugging. We present Revelio, a debugging assistant which takes user reports and system logs as input, and outputs debugging queries that developers can use to find a bugs root cause. The key challenges lie in (1) combining inputs of different types (e.g., natural language reports and quantitative logs) and (2) generalizing to unseen faults. Revelio addresses these by employing deep neural networks to uniformly embed diverse input sources and potential queries into a high-dimensional vector space. In addition, it exploits observations from production systems to factorize query generation into two computationally and statistically simpler learning tasks. To evaluate Revelio, we built a testbed with multiple distributed applications and debugging tools. By injecting faults and training on logs and reports from 800 Mechanical Turkers, we show that Revelio includes the most helpful query in its predicted list of top-3 relevant queries 96% of the time. Our developer study confirms the utility of Revelio.

Distributed Parallel and Cluster Computing Machine Learning

Scalable and Secure Computation Among Strangers: Resource-Competitive Byzantine Protocols

59 - John Augustine , Valerie King , Anisur R. Molla 2019

Motivated, in part, by the rise of permissionless systems such as Bitcoin where arbitrary nodes (whose identities are not known apriori) can join and leave at will, we extend established research in scalable Byzantine agreement to a more practical model where each node (initially) does not know the identity of other nodes. A node can send to new destinations only by sending to random (or arbitrary) nodes, or responding (if it chooses) to messages received from those destinations. We assume a synchronous and fully-connected network, with a full-information, but static Byzantine adversary. A general drawback of existing Byzantine protocols is that the communication cost incurred by the honest nodes may not be proportional to those incurred by the Byzantine nodes; in fact, they can be significantly higher. Our goal is to design Byzantine protocols for fundamental problems which are {em resource competitive}, i.e., the number of bits sent by honest nodes is not much more than those sent by Byzantine nodes. We describe a randomized scalable algorithm to solve Byzantine agreement, leader election, and committee election in this model. Our algorithm sends an expected $O((T+n)log n)$ bits and has latency $O(polylog(n))$, where $n$ is the number of nodes, and $T$ is the minimum of $n^2$ and the number of bits sent by adversarially controlled nodes. The algorithm is resilient to $(1/4-epsilon)n$ Byzantine nodes for any fixed $epsilon > 0$, and succeeds with high probability. Our work can be considered as a first application of resource-competitive analysis to fundamental Byzantine problems. To complement our algorithm we also show lower bounds for resource-competitive Byzantine agreement. We prove that, in general, one cannot hope to design Byzantine protocols that have communication cost that is significantly smaller than the cost of the Byzantine adversary.

Distributed Parallel and Cluster Computing Data Structures and Algorithms

Trusted Multi-Party Computation and Verifiable Simulations: A Scalable Blockchain Approach

70 - Ravi Kiran Raman , Roman Vaculin , Michael Hind 2018

Large-scale computational experiments, often running over weeks and over large datasets, are used extensively in fields such as epidemiology, meteorology, computational biology, and healthcare to understand phenomena, and design high-stakes policies affecting everyday health and economy. For instance, the OpenMalaria framework is a computationally-intensive simulation used by various non-governmental and governmental agencies to understand malarial disease spread and effectiveness of intervention strategies, and subsequently design healthcare policies. Given that such shared results form the basis of inferences drawn, technological solutions designed, and day-to-day policies drafted, it is essential that the computations are validated and trusted. In particular, in a multi-agent environment involving several independent computing agents, a notion of trust in results generated by peers is critical in facilitating transparency, accountability, and collaboration. Using a novel combination of distributed validation of atomic computation blocks and a blockchain-based immutable audits mechanism, this work proposes a universal framework for distributed trust in computations. In particular we address the scalaibility problem by reducing the storage and communication costs using a lossy compression scheme. This framework guarantees not only verifiability of final results, but also the validity of local computations, and its cost-benefit tradeoffs are studied using a synthetic example of training a neural network.

Distributed Parallel and Cluster Computing Information Theory Systems and Control

Scalable Computation of Monge Maps with General Costs

61 - Jiaojiao Fan , Shu Liu , Shaojun Ma 2021

Monge map refers to the optimal transport map between two probability distributions and provides a principled approach to transform one distribution to another. In spite of the rapid developments of the numerical methods for optimal transport problems, computing the Monge maps remains challenging, especially for high dimensional problems. In this paper, we present a scalable algorithm for computing the Monge map between two probability distributions. Our algorithm is based on a weak form of the optimal transport problem, thus it only requires samples from the marginals instead of their analytic expressions, and can accommodate optimal transport between two distributions with different dimensions. Our algorithm is suitable for general cost functions, compared with other existing methods for estimating Monge maps using samples, which are usually for quadratic costs. The performance of our algorithms is demonstrated through a series of experiments with both synthetic and realistic data.

Machine Learning Optimization and Control

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions