No Arabic abstract
Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahls bottleneck of data-parallel training. This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesize collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode SCCLs synthesis as a quantifier-free SMT formula which can be discharged to a theorem prover. We further demonstrate how to scale our synthesis by exploiting symmetries in topologies and collectives. We synthesize and introduce novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.
The performance of collective operations has been a critical issue since the advent of MPI. Many algorithms have been proposed for each MPI collective operation but none of them proved optimal in all situations. Different algorithms demonstrate superior performance depending on the platform, the message size, the number of processes, etc. MPI implementations perform the selection of the collective algorithm empirically, executing a simple runtime decision function. While efficient, this approach does not guarantee the optimal selection. As a more accurate but equally efficient alternative, the use of analytical performance models of collective algorithms for the selection process was proposed and studied. Unfortunately, the previous attempts in this direction have not been successful. We revisit the analytical model-based approach and propose two innovations that significantly improve the selective accuracy of analytical models: (1) We derive analytical models from the code implementing the algorithms rather than from their high-level mathematical definitions. This results in more detailed models. (2) We estimate model parameters separately for each collective algorithm and include the execution of this algorithm in the corresponding communication experiment. We experimentally demonstrate the accuracy and efficiency of our approach using Open MPI broadcast and gather algorithms and a Grid5000 cluster.
Given a boolean predicate $Pi$ on labeled networks (e.g., proper coloring, leader election, etc.), a self-stabilizing algorithm for $Pi$ is a distributed algorithm that can start from any initial configuration of the network (i.e., every node has an arbitrary value assigned to each of its variables), and eventually converge to a configuration satisfying $Pi$. It is known that leader election does not have a deterministic self-stabilizing algorithm using a constant-size register at each node, i.e., for some networks, some of their nodes must have registers whose sizes grow with the size $n$ of the networks. On the other hand, it is also known that leader election can be solved by a deterministic self-stabilizing algorithm using registers of $O(log log n)$ bits per node in any $n$-node bounded-degree network. We show that this latter space complexity is optimal. Specifically, we prove that every deterministic self-stabilizing algorithm solving leader election must use $Omega(log log n)$-bit per node registers in some $n$-node networks. In addition, we show that our lower bounds go beyond leader election, and apply to all problems that cannot be solved by anonymous algorithms.
Observer design typically requires the observability of the underlying system, which may be hard to verify for nonlinear systems, while guaranteeing asymptotic convergence of errors, which may be insufficient in order to satisfy performance conditions in finite time. This paper develops a method to design Luenberger-type observers for nonlinear systems which guarantee the largest possible domain of attraction for the state estimation error regardless of the initialization of the system. The observer design procedure is posed as a two step problem. In the the first step, the error dynamics are abstractly represented as a linear equation on the space of Radon measures. Thereafter, the problem of identifying the largest set of initial errors that can be driven to within the user-specified error target set in finite-time for all possible initial states, and the corresponding observer gains, is formulated as an infinite-dimensional linear program on measures. This optimization problem is solved, using Lasserres relaxations via a sequence of semidefinite programs with vanishing conservatism. By post-processing the solution of step one, the set of gains that maximize the size of tolerable initial errors is identified in step two. To demonstrate the feasibility of the presented approach two examples are presented.
Collective intelligence is the ability of a group to perform more effectively than any individual alone. Diversity among group members is a key condition for the emergence of collective intelligence, but maintaining diversity is challenging in the face of social pressure to imitate ones peers. We investigate the role incentives play in maintaining useful diversity through an evolutionary game-theoretic model of collective prediction. We show that market-based incentive systems produce herding effects, reduce information available to the group and suppress collective intelligence. In response, we propose a new incentive scheme that rewards accurate minority predictions, and show that this produces optimal diversity and collective predictive accuracy. We conclude that real-world systems should reward those who have demonstrated accuracy when majority opinion has been in error.
Gradecast is a simple three-round algorithm presented by Feldman and Micali. The current work presents a very simple algorithm that utilized Gradecast to achieve Byzantine agreement. Two small variations of the presented algorithm lead to improved algorithms for solving the Approximate agreement problem and the Multi-consensus problem. An optimal approximate agreement algorithm was presented by Fekete, which supports up to 1/4 n Byzantine nodes and has message complexity of O(n^k), where n is the number of nodes and k is the number of rounds. Our solution to the approximate agreement problem is optimal, simple and reduces the message complexity to O(k * n^3), while supporting up to 1/3 n Byzantine nodes. Multi consensus was first presented by Bar-Noy et al. It consists of consecutive executions of l Byzantine consensuses. Bar-Noy et al., show an optimal amortized solution to this problem, assuming that all nodes start each consensus instance at the same time, a property that cannot be guaranteed with early stopping. Our solution is simpler, preserves round complexity optimality, allows early stopping and does not require synchronized starts of the consensus instances.