Optimised allgatherv, reduce_scatter and allreduce communication in message-passing systems

227 0 0.0 ( 0 )

Download Cite

Added by Andreas Jocksch

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Andreas Jocksch - Noe Ohana - Emmanuel Lanti

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Collective communications, namely the patterns allgatherv, reduce_scatter, and allreduce in message-passing systems are optimised based on measurements at the installation time of the library. The algorithms used are set up in an initialisation phase of the communication, similar to the method used in so-called persistent collective communication introduced in the literature. For allgatherv and reduce_scatter the existing algorithms, recursive multiply/divide and cyclic shift (Brucks algorithm) are applied with a flexible number of communication ports per node. The algorithms for equal message sizes are used with non-equal message sizes together with a heuristic for rank reordering. The two communication patterns are applied in a plasma physics application that uses a specialised matrix-vector multiplication. For the allreduce pattern the cyclic shift algorithm is applied with a prefix operation. The data is gathered and scattered by the cores within the node and the communication algorithms are applied across the nodes. In general our routines outperform the non-persistent counterparts in established MPI libraries by up to one order of magnitude or show equal performance, with a few exceptions of number of nodes and message sizes.

rate research

Optimal Resilience in Systems that Mix Shared Memory and Message Passing

85 - Hagit Attiya , Sweta Kumari , 2020

We investigate the minimal number of failures that can partition a system where processes communicate both through shared memory and by message passing. We prove that this number precisely captures the resilience that can be achieved by algorithms that implement a variety of shared objects, like registers and atomic snapshots, and solve common tasks, like randomized consensus, approximate agreement and renaming. This has implications for the m&m-model and for the hybrid, cluster-based model.

Distributed Parallel and Cluster Computing

An Impossibility Result on Strong Linearizability in Message-Passing Systems

195 - David Yu Cheng Chan , Vassos Hadzilacos , Xing Hu 2021

We prove that in asynchronous message-passing systems where at most one process may crash, there is no lock-free strongly linearizable implementation of a weak object that we call Test-or-Set (ToS). This object allows a single distinguished process to apply the set operation once, and a different distinguished process to apply the test operation also once. Since this weak object can be directly implemented by a single-writer single-reader (SWSR) register (and other common objects such as max-register, snapshot and counter), this result implies that there is no $1$-resilient lock-free strongly linearizable implementation of a SWSR register (and of these other objects) in message-passing systems. We also prove that there is no $1$-resilient lock-free emph{write} strongly-linearizable implementation of a 2-writer 1-reader (2W1R) register in asynchronous message-passing systems.

Distributed Parallel and Cluster Computing

Characterizing Asynchronous Message-Passing Models Through Rounds

60 - Adam Shimi , Aurelie Hurault , Philippe Queinnec 2018

Message-passing models of distributed computing vary along numerous dimensions: degree of synchrony, kind of faults, number of faults... Unfortunately, the sheer number of models and their subtle distinctions hinder our ability to design a general theory of message-passing models. One way out of this conundrum restricts communication to proceed by round. A great variety of message-passing models can then be captured in the Heard-Of model, through predicates on the messages sent in a round and received during or before this round. Then, the issue is to find the most accurate Heard-Of predicate to capture a given model. This is straightforward in synchronous models, because waiting for the upper bound on communication delay ensures that all available messages are received, while not waiting forever. On the other hand, asynchrony allows unbounded message delays. Is there nonetheless a meaningful characterization of asynchronous models by a Heard-Of predicate? We formalize this characterization by introducing Delivered collections: the collections of all messages delivered at each round, whether late or not. Predicates on Delivered collections capture message-passing models. The question is to determine which Heard-Of predicates can be generated by a given Delivered predicate. We answer this by formalizing strategies for when to change round. Thanks to a partial order on these strategies, we also find the best strategy for multiple models, where best intuitively means it waits for as many messages as possible while not waiting forever. Finally, a strategy for changing round that never blocks a process forever implements a Heard-Of predicate. This allows us to translate the order on strategies into an order on Heard-Of predicates. The characterizing predicate for a model is then the greatest element for that order, if it exists.

Distributed Parallel and Cluster Computing

Flare: Flexible In-Network Allreduce

91 - Daniele De Sensi , Salvatore Di Girolamo , Saleh Ashkboos 2021

The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches.

Distributed Parallel and Cluster Computing Hardware Architecture Networking and Internet Architecture

Self-Stabilizing and Private Distributed Shared Atomic Memory in Seldomly Fair Message Passing Networks

107 - Shlomi Dolev , Thomas Petig , Elad Michael Schiller 2018

We study the problem of privately emulating shared memory in message-passing networks. The system includes clients that store and retrieve replicated information on N servers, out of which e are malicious. When a client access a malicious server, the data field of that server response might be different than the value it originally stored. However, all other control variables in the server reply and protocol actions are according to the server algorithm. For the coded atomic storage (CAS) algorithms by Cadambe et al., we present an enhancement that ensures no information leakage and malicious fault-tolerance. We also consider recovery after the occurrence of transient faults that violate the assumptions according to which the system is to behave. After their last occurrence, transient faults leave the system in an arbitrary state (while the program code stays intact). We present a self-stabilizing algorithm, which recovers after the occurrence of transient faults. This addition to Cadambe et al. considers asynchronous settings as long as no transient faults occur. The recovery from transient faults that bring the system counters (close) to their maximal values may include the use of a global reset procedure, which requires the system run to be controlled by a fair scheduler. After the recovery period, the safety properties are provided for asynchronous system runs that are not necessarily controlled by fair schedulers. Since the recovery period is bounded and the occurrence of transient faults is extremely rare, we call this design criteria self-stabilization in the presence of seldom fairness. Our self-stabilizing algorithm uses a bounded storage during asynchronous executions (that are not necessarily fair). To the best of our knowledge, we are the first to address privacy and self-stabilization in the context of emulating atomic shared memory in networked systems.

Distributed Parallel and Cluster Computing