Specializing Coherence, Consistency, and Push/Pull for GPU Graph Analytics

61 0 0.0 ( 0 )

Download Cite

Added by Wesley Darvin

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Giordano Salvador - Wesley H. Darvin - Muhammad Huzaifa

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This work provides the first study to explore the interaction of update propagation with and without fine-grained synchronization (push vs. pull), emerging coherence protocols (GPU vs. DeNovo coherence), and software-centric consistency models (DRF0, DRF1, and DRFrlx) for graph workloads on emerging integrated GPU-CPU systems with native unified shared memory. We study 6 graph applications with 6 graph inputs for a total of 36 workloads running on 12 system (hardware+software) configurations reflecting the above design space of update propagation, coherence, and memory consistency. We make three key contributions. First, we show that there is no single best system configuration for all workloads, motivating systems with flexible coherence and consistency support. Second, we develop a model to accurately predict the best system configuration -- this model can be used by software designers to decide on push vs. pull and the consistency model and by flexible hardware to invoke the appropriate coherence and consistency configuration for the given workload. Third, we show that the design dimensions explored here are inter-dependent, reinforcing the need for software-hardware co-design in the above design dimensions. For example, software designers deciding on push vs. pull must consider the consistency model supported by hardware -- in some cases, push maybe better if hardware supports DRFrlx while pull may be better if hardware does not support DRFrlx.

rate research

To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations

76 - Maciej Besta , Michal Podstawski , Linus Groner 2020

We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state.We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters.We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.

Distributed Parallel and Cluster Computing Data Structures and Algorithms

A New Frontier for Pull-Based Graph Processing

107 - Samuel Grossman , Christos Kozyrakis 2019

The trade-off between pull-based and push-based graph processing engines is well-understood. On one hand, pull-based engines can achieve higher throughput because their workloads are read-dominant, rather than write-dominant, and can proceed without synchronization between threads. On the other hand, push-based engines are much better able to take advantage of the frontier optimization, which leverages the fact that often only a small subset of the graph needs to be accessed to complete an iteration of a graph processing application. Hybrid engines attempt to overcome this trade-off by dynamically switching between push and pull, but there are two key disadvantages with this approach. First, applications must be implemented twice (once for push and once for pull), and second, processing throughput is reduced for iterations that run with push. We propose a radically different solution: rebuild the frontier optimization entirely such that it is well-suited for a pull-based engine. In doing so, we remove the only advantage that a push-based engine had over a pull-based engine, making it possible to eliminate the push-based engine entirely. We introduce Wedge, a pull-only graph processing framework that transforms the traditional source-oriented vertex-based frontier into a pull-friendly format called the Wedge Frontier. The transformation itself is expensive even when parallelized, so we introduce two key optimizations to make it practical. First, we perform the transformation only when the resulting Wedge Frontier is sufficiently sparse. Second, we coarsen the granularity of the representation of elements in the Wedge Frontier. These optimizations respectively improve Wedges performance by up to 5x and 2x, enabling it to outperform Grazelle, Ligra, and GraphMat respectively by up to 2.8x, 4.9x, and 185.5x.

Distributed Parallel and Cluster Computing

A Comparison of Push and Pull Techniques for Ajax

483 - Engin Bozdag , Ali Mesbah , Arie van Deursen 2007

Ajax applications are designed to have high user interactivity and low user-perceived latency. Real-time dynamic web data such as news headlines, stock tickers, and auction updates need to be propagated to the users as soon as possible. However, Ajax still suffers from the limitations of the Webs request/response architecture which prevents servers from pushing real-time dynamic web data. Such applications usually use a pull style to obtain the latest updates, where the client actively requests the changes based on a predefined interval. It is possible to overcome this limitation by adopting a push style of interaction where the server broadcasts data when a change occurs on the server side. Both these options have their own trade-offs. This paper explores the fundamental limits of browser-based applications and analyzes push solutions for Ajax technology. It also shows the results of an empirical study comparing push and pull.

Software Engineering Performance

Push-Pull Optimization of Quantum Controls

99 - Priya Batra , V. R. Krithika , T. S. Mahesh 2019

Quantum optimal control involves setting up an objective function that evaluates the quality of an operator representing the realized process w.r.t. the target process. Here we propose a stronger objective function which incorporates not only the target operator but also a set of its orthogonal operators. We find significantly superior convergence of optimization routines with the combined influences of all the operators. We refer to this method as the $textit{push-pull}$ optimization. In particular, we describe adopting the push-pull optimization to a gradient based approach and a variational-principle based approach. We carry out extensive numerical simulations of the push-pull optimization of quantum controls on a pair of Ising coupled qubits. Finally, we demonstrate its experimental application by preparing a long-lived singlet-order in a two-qubit system using NMR techniques.

Quantum Physics Systems and Control Systems and Control

Push-pull operators on convex polytopes

82 - Valentina Kiritchenko 2020

A classical result of Schubert calculus is an inductive description of Schubert cycles using divided difference (or push-pull) operators in Chow rings. We define convex geometric analogs of push-pull operators and describe their applications to the theory of Newton-Okounkov convex bodies. Convex geometric push-pull operators yield an inductive construction of Newton-Okounkov polytopes of Bott-Samelson varieties. In particular, we construct a Minkowski sum of Feigin-Fourier-Littelmann-Vinberg polytopes using convex geometric push-pull operators in type A.

Algebraic Geometry Representation Theory