Massively Parallel Computation via Remote Memory Access

210 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Soheil Behnezhad

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Soheil Behnezhad - Laxman Dhulipala - Hossein Esfandiari

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms using MapReduce and a distributed hash table service. This extension allows us to give new graph algorithms with much lower round complexities compared to the best known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in $O(1)$ rounds and connectivity/minimum spanning tree in $O(loglog_{m/n} n)$ rounds both using $O(n^delta)$ space per machine for constant $delta < 1$. In the same memory regime for MPC, the best known algorithms for these problems require polylog $n$ rounds. Our results imply that the 2-Cycle conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.

قيم البحث

86 - Melanie Cambus , Davin Choo , Havu Miikonen 2021

Identifying clusters of similar elements in a set is a common task in data analysis. With the immense growth of data and physical limitations on single processor speed, it is necessary to find efficient parallel algorithms for clustering tasks. In th is paper, we study the problem of correlation clustering in bounded arboricity graphs with respect to the Massively Parallel Computation (MPC) model. More specifically, we are given a complete graph where the edges are either positive or negative, indicating whether pairs of vertices are similar or dissimilar. The task is to partition the vertices into clusters with as few disagreements as possible. That is, we want to minimize the number of positive inter-cluster edges and negative intra-cluster edges. Consider an input graph $G$ on $n$ vertices such that the positive edges induce a $lambda$-arboric graph. Our main result is a 3-approximation ($textit{in expectation}$) algorithm to correlation clustering that runs in $mathcal{O}(log lambda cdot textrm{poly}(log log n))$ MPC rounds in the $textit{strongly sublinear memory regime}$. This is obtained by combining structural properties of correlation clustering on bounded arboricity graphs with the insights of Fischer and Noever (SODA 18) on randomized greedy MIS and the $texttt{PIVOT}$ algorithm of Ailon, Charikar, and Newman (STOC 05). Combined with known graph matching algorithms, our structural property also implies an exact algorithm and algorithms with $textit{worst case}$ $(1+epsilon)$-approximation guarantees in the special case of forests, where $lambda=1$.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Equivalence Classes and Conditional Hardness in Massively Parallel Computations

141 - Danupon Nanongkai , Michele Scquizzato 2020

The Massively Parallel Computation (MPC) model serves as a common abstraction of many modern large-scale data processing frameworks, and has been receiving increasingly more attention over the past few years, especially in the context of classical gr aph problems. So far, the only way to argue lower bounds for this model is to condition on conjectures about the hardness of some specific problems, such as graph connectivity on promise graphs that are either one cycle or two cycles, usually called the one cycle vs. two cycles problem. This is unlike the traditional arguments based on conjectures about complexity classes (e.g., $textsf{P} eq textsf{NP}$), which are often more robust in the sense that refuting them would lead to groundbreaking algorithms for a whole bunch of problems. In this paper we present connections between problems and classes of problems that allow the latter type of arguments. These connections concern the class of problems solvable in a sublogarithmic amount of rounds in the MPC model, denoted by $textsf{MPC}(o(log N))$, and some standard classes concerning space complexity, namely $textsf{L}$ and $textsf{NL}$, and suggest conjectures that are robust in the sense that refuting them would lead to many surprisingly fast new algorithms in the MPC model. We also obtain new conditional lower bounds, and prove new reductions and equivalences between problems in the MPC model.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems

223 - Huiwei Lv , Guangming Tan , Mingyu Chen 2012

For parallel breadth first search (BFS) algorithm on large-scale distributed memory systems, communication often costs significantly more than arithmetic and limits the scalability of the algorithm. In this paper we sufficiently reduce the communicat ion cost in distributed BFS by compressing and sieving the messages. First, we leverage a bitmap compression algorithm to reduce the size of messages before communication. Second, we propose a novel distributed directory algorithm, cross directory, to sieve the redundant data in messages. Experiments on a 6,144-core SMP cluster show our algorithm outperforms the baseline implementation in Graph500 by 2.2 times, reduces its communication time by 79.0%, and achieves a performance rate of 12.1 GTEPS (billion edge visits per second)

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Communication-free Massively Distributed Graph Generation

92 - Daniel Funke , Sebastian Lamm , Ulrich Meyer 2017

Analyzing massive complex networks yields promising insights about our everyday lives. Building scalable algorithms to do so is a challenging task that requires a careful analysis and an extensive evaluation. However, engineering such algorithms is o ften hindered by the scarcity of publicly~available~datasets. Network generators serve as a tool to alleviate this problem by providing synthetic instances with controllable parameters. However, many network generators fail to provide instances on a massive scale due to their sequential nature or resource constraints. Additionally, truly scalable network generators are few and often limited in their realism. In this work, we present novel generators for a variety of network models that are frequently used as benchmarks. By making use of pseudorandomization and divide-and-conquer schemes, our generators follow a communication-free paradigm. The resulting generators are thus embarrassingly parallel and have a near optimal scaling behavior. This allows us to generate instances of up to $2^{43}$ vertices and $2^{47}$ edges in less than 22 minutes on 32768 cores. Therefore, our generators allow new graph families to be used on an unprecedented scale.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات الشبكات الاجتماعية والمعلومات

Massively parallel lattice-Boltzmann codes on large GPU clusters

82 - E. Calore , A. Gabbana , J. Kraus 2017

This paper describes a massively parallel code for a state-of-the art thermal lattice- Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Version s of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bot- tlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and op- timization methodology that can be used for the development of other high performance applications for computational physics.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات