Optimised allgatherv, reduce_scatter and allreduce communication in message-passing systems

227 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Andreas Jocksch

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Andreas Jocksch - Noe Ohana - Emmanuel Lanti

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Collective communications, namely the patterns allgatherv, reduce_scatter, and allreduce in message-passing systems are optimised based on measurements at the installation time of the library. The algorithms used are set up in an initialisation phase of the communication, similar to the method used in so-called persistent collective communication introduced in the literature. For allgatherv and reduce_scatter the existing algorithms, recursive multiply/divide and cyclic shift (Brucks algorithm) are applied with a flexible number of communication ports per node. The algorithms for equal message sizes are used with non-equal message sizes together with a heuristic for rank reordering. The two communication patterns are applied in a plasma physics application that uses a specialised matrix-vector multiplication. For the allreduce pattern the cyclic shift algorithm is applied with a prefix operation. The data is gathered and scattered by the cores within the node and the communication algorithms are applied across the nodes. In general our routines outperform the non-persistent counterparts in established MPI libraries by up to one order of magnitude or show equal performance, with a few exceptions of number of nodes and message sizes.

قيم البحث

85 - Hagit Attiya , Sweta Kumari , 2020

We investigate the minimal number of failures that can partition a system where processes communicate both through shared memory and by message passing. We prove that this number precisely captures the resilience that can be achieved by algorithms th at implement a variety of shared objects, like registers and atomic snapshots, and solve common tasks, like randomized consensus, approximate agreement and renaming. This has implications for the m&m-model and for the hybrid, cluster-based model.

النظم الموزعة والتوازية والحوسبة العنقودية

An Impossibility Result on Strong Linearizability in Message-Passing Systems

195 - David Yu Cheng Chan , Vassos Hadzilacos , Xing Hu 2021

We prove that in asynchronous message-passing systems where at most one process may crash, there is no lock-free strongly linearizable implementation of a weak object that we call Test-or-Set (ToS). This object allows a single distinguished process t o apply the set operation once, and a different distinguished process to apply the test operation also once. Since this weak object can be directly implemented by a single-writer single-reader (SWSR) register (and other common objects such as max-register, snapshot and counter), this result implies that there is no $1$-resilient lock-free strongly linearizable implementation of a SWSR register (and of these other objects) in message-passing systems. We also prove that there is no $1$-resilient lock-free emph{write} strongly-linearizable implementation of a 2-writer 1-reader (2W1R) register in asynchronous message-passing systems.

النظم الموزعة والتوازية والحوسبة العنقودية

Characterizing Asynchronous Message-Passing Models Through Rounds

60 - Adam Shimi , Aurelie Hurault , Philippe Queinnec 2018

Message-passing models of distributed computing vary along numerous dimensions: degree of synchrony, kind of faults, number of faults... Unfortunately, the sheer number of models and their subtle distinctions hinder our ability to design a general th eory of message-passing models. One way out of this conundrum restricts communication to proceed by round. A great variety of message-passing models can then be captured in the Heard-Of model, through predicates on the messages sent in a round and received during or before this round. Then, the issue is to find the most accurate Heard-Of predicate to capture a given model. This is straightforward in synchronous models, because waiting for the upper bound on communication delay ensures that all available messages are received, while not waiting forever. On the other hand, asynchrony allows unbounded message delays. Is there nonetheless a meaningful characterization of asynchronous models by a Heard-Of predicate? We formalize this characterization by introducing Delivered collections: the collections of all messages delivered at each round, whether late or not. Predicates on Delivered collections capture message-passing models. The question is to determine which Heard-Of predicates can be generated by a given Delivered predicate. We answer this by formalizing strategies for when to change round. Thanks to a partial order on these strategies, we also find the best strategy for multiple models, where best intuitively means it waits for as many messages as possible while not waiting forever. Finally, a strategy for changing round that never blocks a process forever implements a Heard-Of predicate. This allows us to translate the order on strategies into an order on Heard-Of predicates. The characterizing predicate for a model is then the greatest element for that order, if it exists.

النظم الموزعة والتوازية والحوسبة العنقودية

Flare: Flexible In-Network Allreduce

91 - Daniele De Sensi , Salvatore Di Girolamo , Saleh Ashkboos 2021

The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregat e the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches.

النظم الموزعة والتوازية والحوسبة العنقودية هندسة العتاد بنية الشبكات والإنترنت

Self-Stabilizing and Private Distributed Shared Atomic Memory in Seldomly Fair Message Passing Networks

107 - Shlomi Dolev , Thomas Petig , Elad Michael Schiller 2018

We study the problem of privately emulating shared memory in message-passing networks. The system includes clients that store and retrieve replicated information on N servers, out of which e are malicious. When a client access a malicious server, the data field of that server response might be different than the value it originally stored. However, all other control variables in the server reply and protocol actions are according to the server algorithm. For the coded atomic storage (CAS) algorithms by Cadambe et al., we present an enhancement that ensures no information leakage and malicious fault-tolerance. We also consider recovery after the occurrence of transient faults that violate the assumptions according to which the system is to behave. After their last occurrence, transient faults leave the system in an arbitrary state (while the program code stays intact). We present a self-stabilizing algorithm, which recovers after the occurrence of transient faults. This addition to Cadambe et al. considers asynchronous settings as long as no transient faults occur. The recovery from transient faults that bring the system counters (close) to their maximal values may include the use of a global reset procedure, which requires the system run to be controlled by a fair scheduler. After the recovery period, the safety properties are provided for asynchronous system runs that are not necessarily controlled by fair schedulers. Since the recovery period is bounded and the occurrence of transient faults is extremely rare, we call this design criteria self-stabilization in the presence of seldom fairness. Our self-stabilizing algorithm uses a bounded storage during asynchronous executions (that are not necessarily fair). To the best of our knowledge, we are the first to address privacy and self-stabilization in the context of emulating atomic shared memory in networked systems.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات